WebHarvest: Easy Web Scraping from Java?|?masochismtango

来源：百度文库编辑：神马文学网时间：2024/06/30 21:13:47

WebHarvest: Easy Web Scraping from Java

// February 15th, 2010 // Dev, Web

I’ve been experimenting with data visualisation for awhile now, most of which is for Masabi’sbusiness plan though I hope to share some offshoots soon.

I often have a need to quickly scrape some data out of a web page (orlist of web pages), which can then be fed into Excel and on tospecialist data visualisation tools like Tableau (available in a free public editionhere – my initial impressions are positive but it’s early days yet).

To this end I have turned to WebHarvest, an excellentscriptable open source API for web scraping in Java. I really reallylike it, but there are some quirks and setup issues that have cost mehours so I thought I’d roll together a tutorial with the fixes.

WebHarvest Config for Maven

When it works Maven is alovely tool to hide dependency management for Java projects, butWebHarvest is not configured qiute right out of the box to worktransparently with it. (Describing Maven is beyond the scope of thispost, but if you don’t know it, it’s easy to setup with the M2 plugin for Eclipse.)

This is the Maven POM I ended up with to use WebHarvest in a newJavaSE project:

viewplaincopyto clipboardprint?

xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
4.0.0
WebScraping
WebScraping
jar
0.00.01
UTF-8
maven-compiler-plugin
1.6
1.6
wso2
http://dist.wso2.org/maven2/
maven-repository-1
http://repo1.maven.org/maven2/
commons-logging
commons-logging
1.1
jar
compile
log4j
log4j
1.2.12
jar
compile
org.webharvest.wso2
webharvest-core
1.0.0.wso2v1
jar
compile
net.sf.saxon
saxon-xom
8.7
org.htmlcleaner
htmlcleaner
1.55
bsh
bsh
1.3.0
commons-httpclient
commons-httpclient
3.1

4.0.0WebScrapingWebScrapingjar0.00.01UTF-8maven-compiler-plugin1.61.6wso2http://dist.wso2.org/maven2/maven-repository-1http://repo1.maven.org/maven2/commons-loggingcommons-logging1.1jarcompilelog4jlog4j1.2.12jarcompileorg.webharvest.wso2webharvest-core1.0.0.wso2v1jarcompilenet.sf.saxonsaxon-xom8.7org.htmlcleanerhtmlcleaner1.55bshbsh1.3.0commons-httpclientcommons-httpclient3.1

You’ll note that the WebHarvest dependencies had to be addedexplicitly, because the jar does not come with a working pom listingthem.

Writing A Scraping Script

WebHarvest uses XML configuration files to describe how to scrape asite – and with a few lines of Java code you can run any XMLconfiguration and have access to any properties that the scriptidentified from the page. This is definitely the safest way to scrapedata, as it decouples the code from the web page markup – so if the siteyou are scraping goes through a redesign, you can quickly adjust theconfig files without recompiling the code they pass data to.

The site some good example scriptsto show you how to get started, so I won’t repeat them here. Theeasiest way to create your own is to run the WebHarvest GUI from thecommand line, start with a sample script, and then hack it around to getwhat you want – it’s an easy iterative process with good feedback inthe UI.

As a simple example, this is a script to go to the Sony-Ericssondeveloper site’s handset gallery at http://developer.sonyericsson.com/device/searchDevice.do?restart=true,and rip each handset’s individual spec page URI:

viewplaincopyto clipboardprint?

The handset URIs will end up in a list of variables, from uri.1to uri.N.

The XML configuration’s syntax can take a little getting used to – itappeared quite backwards to me at first, but by messing around in theGUI you can experiment and learn pretty fast. With a basicunderstanding of XPathto identify parts of the web page, and perhaps a little regularexpression knowledge to get at information surrounded by plain text,you can perform some very powerful scraping.

We can then define another script which will take this URI, and pullout a piece of information from the page – in this example, it will showthe region(s) that the handset was released in:

viewplaincopyto clipboardprint?

([\d]*)x([\d]*)

([\d]*)x([\d]*)

At this point I should note the biggest gotcha with WebHarvest, thatjust caused me 3 hours of hear tearing. In the script, this linedefines the page to scrape: ${uri}"/>,where ${uri} is a variable specified at runtime to define aURI. This works.

If you were to substitute in this perfectly sensible alternative: ${url}"/>, you would end up with acompletely obscure runtime exception a little like this:

Exception in thread "main" org.webharvest.exception.ScriptException: Cannot set variable in scripter: Field access: bsh.ReflectError: No such field: 1at org.webharvest.runtime.scripting.BeanShellScriptEngine.setVariable(Unknown Source)at org.webharvest.runtime.scripting.ScriptEngine.pushAllVariablesFromContextToScriptEngine(Unknown Source)at org.webharvest.runtime.scripting.BeanShellScriptEngine.eval(Unknown Source)at org.webharvest.runtime.templaters.BaseTemplater.execute(Unknown Source)at org.webharvest.runtime.processors.TemplateProcessor.execute(Unknown Source)at org.webharvest.runtime.processors.BaseProcessor.run(Unknown Source)at org.webharvest.runtime.processors.BodyProcessor.execute(Unknown Source)at org.webharvest.runtime.processors.VarDefProcessor.execute(Unknown Source)at org.webharvest.runtime.processors.BaseProcessor.run(Unknown Source)at org.webharvest.runtime.processors.BodyProcessor.execute(Unknown Source)at org.webharvest.runtime.processors.BaseProcessor.run(Unknown Source)at org.webharvest.runtime.processors.LoopProcessor.execute(Unknown Source)at org.webharvest.runtime.processors.BaseProcessor.run(Unknown Source)at org.webharvest.runtime.Scraper.execute(Unknown Source)at org.webharvest.runtime.Scraper.execute(Unknown Source)at scrape.QuickScraper.scrapeUrlList(QuickScraper.java:82)at scrape.QuickScraper.scrapeUrlList(QuickScraper.java:49)at scrape.ActualScraper.main(DhfScraper.java:37)Caused by: Field access: bsh.ReflectError: No such field: 1 : at Line: -1 : in file:  : at bsh.UtilEvalError.toEvalError(Unknown Source)at bsh.UtilEvalError.toEvalError(Unknown Source)at bsh.Interpreter.set(Unknown Source)... 18 more

You have been warned!

Running The Scripts From Java

WebHarvest requires very little code to run. I created this littlereusable harness class to quickly run the two types of script – one topull information from a page, and one to farm URLs from which to scrapedata. You can use the first without the second, of course.

viewplaincopyto clipboardprint?

package scrape;
import java.io.*;
import java.util.*;
import org.apache.commons.logging.*;
import org.webharvest.definition.ScraperConfiguration;
import org.webharvest.runtime.*;
import org.webharvest.runtime.variables.Variable;
/**
* Quick hackable web scraping class.
* @author Tom Godber
*/
public abstract class QuickScraper
{
/** Logging object. */
protected final Log LOG = LogFactory.getLog(getClass());
/** Prefix for any variable scraped which defines a URL. It will be followed by a counter. */
public static final String SCRAPED_URL_VARIABLE_PREFIX = "url.";
/** A variable name which holds the initial URL to scrape. */
public static final String START_URL_VARIABLE = "url";
/** A temporary working folder. */
private File working = new File("temp");
/** Ensures temp folder exists.` */
public QuickScraper()
{
working.mkdirs();
}
/**
* Scrapes a list of URLs which are automatically derived from a page.
* The initial URL must be set in the actual URL list config XML.
* @param urlConfigXml Path of an XML describing how to scrape the URL list.
* @param pageConfigXml Path of an XML describing how to scrape the individual pages found.#
* @return The number of URLs processed, or -1 if the config could not be loaded.
*/
protected int scrapeUrlList(String urlConfigXml, String pageConfigXml)
{
return scrapeUrlList(new HashMap(), urlConfigXml, pageConfigXml);
}
/**
* Scrapes a list of URLs which are automatically derived from a page.
* @param setup Optional configuration for the script
* @param urlConfigXml Path of an XML describing how to scrape the URL list.
* @param pageConfigXml Path of an XML describing how to scrape the individual pages found.#
* @return The number of URLs processed, or -1 if the config could not be loaded.
*/
protected int scrapeUrlList(Map setup, String urlConfigXml, String pageConfigXml)
{
return scrapeUrlList(setup, new File(urlConfigXml), new File(pageConfigXml));
}
/**
* Scrapes a list of URLs which are automatically derived from a page.
* The initial URL must be set in the actual URL list config XML.
* @param urlConfigXml XML describing how to scrape the URL list.
* @param pageConfigXml XML describing how to scrape the individual pages found.#
* @return The number of URLs processed, or -1 if the config could not be loaded.
*/
protected int scrapeUrlList(File urlConfigXml, File pageConfigXml)
{
return scrapeUrlList(new HashMap(), urlConfigXml, pageConfigXml);
}
/**
* Scrapes a list of URLs which are automatically derived from a page.
* @param setup Optional configuration for the script
* @param urlConfigXml XML describing how to scrape the URL list.
* @param pageConfigXml XML describing how to scrape the individual pages found.
* @return The number of URLs processed, or -1 if the config could not be loaded.
* @throws NullPointerException If the setup map is null.
*/
protected int scrapeUrlList(Map setup, File urlConfigXml, File pageConfigXml)
{
try
{
if (LOG.isDebugEnabled()) LOG.debug("Starting scrape with temp folder "+working.getAbsolutePath()+"...");
// generate a one-off scraper based on preloaded configuration
ScraperConfiguration config = new ScraperConfiguration(urlConfigXml);
Scraper scraper = new Scraper(config, working.getAbsolutePath());
// initialise any config
setupScraperContext(setup, scraper);
// run the script
scraper.execute();
// rip the URL list out of the scraped content
ScraperContext context = scraper.getContext();
int i=1;
Variable scrapedUrl;
if (LOG.isDebugEnabled()) LOG.debug("Scraping performed, pulling URLs '"+SCRAPED_URL_VARIABLE_PREFIX+"n' from "+context.size()+" variables, starting with "+i+"...");
while ((scrapedUrl = (Variable) context.get(SCRAPED_URL_VARIABLE_PREFIX+i)) != null)
{
if (LOG.isTraceEnabled()) LOG.trace("Found "+SCRAPED_URL_VARIABLE_PREFIX+i+": "+scrapedUrl.toString());
// parse this URL
setup.put(START_URL_VARIABLE, scrapedUrl.toString());
scrapeUrl(setup, pageConfigXml);
// move on
i++;
}
if (LOG.isDebugEnabled()) LOG.debug("No more URLs found.");
return i;
}
catch (FileNotFoundException e)
{
if (LOG.isErrorEnabled()) LOG.error("Could not find config file '"+urlConfigXml.getAbsolutePath()+"' - no scraping was done for this WebHarvest XML.", e);
return -1;
}
finally
{
working.delete();
}
}
/**
* Scrapes an individual page, and passed the results on for processing.
* The script must contain a hardcoded URL.
* @param configXml XML describing how to scrape an individual page.
*/
protected void scrapeUrl(File configXml)
{
scrapeUrl((String)null, configXml);
}
/**
* Scrapes an individual page, and passed the results on for processing.
* @param url The URL to scrape. If null, the URL must be set in the config itself.
* @param configXml XML describing how to scrape an individual page.
*/
protected void scrapeUrl(String url, File configXml)
{
Map setup = new HashMap();
if (url!=null) setup.put(START_URL_VARIABLE, url);
scrapeUrl(setup, configXml);
}
/**
* Scrapes an individual page, and passed the results on for processing.
* @param setup Optional configuration for the script
* @param configXml XML describing how to scrape an individual page.
*/
protected void scrapeUrl(Map setup, File configXml)
{
try
{
if (LOG.isDebugEnabled()) LOG.debug("Starting scrape with temp folder "+working.getAbsolutePath()+"...");
// generate a one-off scraper based on preloaded configuration
ScraperConfiguration config = new ScraperConfiguration(configXml);
Scraper scraper = new Scraper(config, working.getAbsolutePath());
setupScraperContext(setup, scraper);
scraper.execute();
// handle contents in some way
pageScraped((String)setup.get(START_URL_VARIABLE), scraper.getContext());
if (LOG.isDebugEnabled()) LOG.debug("Page scraping complete.");
}
catch (FileNotFoundException e)
{
if (LOG.isErrorEnabled()) LOG.error("Could not find config file '"+configXml.getAbsolutePath()+"' - no scraping was done for this WebHarvest XML.", e);
}
finally
{
working.delete();
}
}
/**
* @param setup Any variables to be set before the script runs.
* @param scraper The object which does the scraping.
*/
private void setupScraperContext(Map setup, Scraper scraper)
{
if (setup!=null)
for (String key : setup.keySet())
scraper.getContext().setVar(key, setup.get(key));
}
/**
* Process a page that was scraped.
* @param url The URL that was scraped.
* @param context The contents of the scraped page.
*/
public abstract void pageScraped(String url, ScraperContext context);
}

package scrape;import java.io.*;import java.util.*;import org.apache.commons.logging.*;import org.webharvest.definition.ScraperConfiguration;import org.webharvest.runtime.*;import org.webharvest.runtime.variables.Variable;/*** Quick hackable web scraping class.* @author Tom Godber*/public abstract class QuickScraper{/** Logging object. */protected final Log LOG = LogFactory.getLog(getClass());/** Prefix for any variable scraped which defines a URL. It will be followed by a counter. */public static final String SCRAPED_URL_VARIABLE_PREFIX = "url.";/** A variable name which holds the initial URL to scrape. */public static final String START_URL_VARIABLE = "url";/** A temporary working folder. */private File working = new File("temp");/** Ensures temp folder exists.` */public QuickScraper(){working.mkdirs();}/*** Scrapes a list of URLs which are automatically derived from a page.* The initial URL must be set in the actual URL list config XML.* @param urlConfigXml Path of an XML describing how to scrape the URL list.* @param pageConfigXml Path of an XML describing how to scrape the individual pages found.#* @return The number of URLs processed, or -1 if the config could not be loaded.*/protected int scrapeUrlList(String urlConfigXml, String pageConfigXml){return scrapeUrlList(new HashMap(), urlConfigXml, pageConfigXml);}/*** Scrapes a list of URLs which are automatically derived from a page.* @param setup Optional configuration for the script* @param urlConfigXml Path of an XML describing how to scrape the URL list.* @param pageConfigXml Path of an XML describing how to scrape the individual pages found.#* @return The number of URLs processed, or -1 if the config could not be loaded.*/protected int scrapeUrlList(Map setup, String urlConfigXml, String pageConfigXml){return scrapeUrlList(setup, new File(urlConfigXml), new File(pageConfigXml));}/*** Scrapes a list of URLs which are automatically derived from a page.* The initial URL must be set in the actual URL list config XML.* @param urlConfigXml XML describing how to scrape the URL list.* @param pageConfigXml XML describing how to scrape the individual pages found.#* @return The number of URLs processed, or -1 if the config could not be loaded.*/protected int scrapeUrlList(File urlConfigXml, File pageConfigXml){return scrapeUrlList(new HashMap(), urlConfigXml, pageConfigXml);}/*** Scrapes a list of URLs which are automatically derived from a page.* @param setup Optional configuration for the script* @param urlConfigXml XML describing how to scrape the URL list.* @param pageConfigXml XML describing how to scrape the individual pages found.* @return The number of URLs processed, or -1 if the config could not be loaded.* @throws NullPointerException If the setup map is null.*/protected int scrapeUrlList(Map setup, File urlConfigXml, File pageConfigXml){try{if (LOG.isDebugEnabled())LOG.debug("Starting scrape with temp folder "+working.getAbsolutePath()+"...");// generate a one-off scraper based on preloaded configurationScraperConfiguration config = new ScraperConfiguration(urlConfigXml);Scraper scraper = new Scraper(config, working.getAbsolutePath());// initialise any configsetupScraperContext(setup, scraper);// run the scriptscraper.execute();// rip the URL list out of the scraped contentScraperContext context = scraper.getContext();int i=1;Variable scrapedUrl;if (LOG.isDebugEnabled())LOG.debug("Scraping performed, pulling URLs '"+SCRAPED_URL_VARIABLE_PREFIX+"n' from "+context.size()+" variables, starting with "+i+"...");while ((scrapedUrl = (Variable) context.get(SCRAPED_URL_VARIABLE_PREFIX+i))  != null){if (LOG.isTraceEnabled())LOG.trace("Found "+SCRAPED_URL_VARIABLE_PREFIX+i+": "+scrapedUrl.toString());// parse this URLsetup.put(START_URL_VARIABLE, scrapedUrl.toString());scrapeUrl(setup, pageConfigXml);// move oni++;}if (LOG.isDebugEnabled())LOG.debug("No more URLs found.");return i;}catch (FileNotFoundException e){if (LOG.isErrorEnabled())LOG.error("Could not find config file '"+urlConfigXml.getAbsolutePath()+"' - no scraping was done for this WebHarvest XML.", e);return -1;}finally{working.delete();}}/*** Scrapes an individual page, and passed the results on for processing.* The script must contain a hardcoded URL.* @param configXml XML describing how to scrape an individual page.*/protected void scrapeUrl(File configXml){scrapeUrl((String)null, configXml);}/*** Scrapes an individual page, and passed the results on for processing.* @param url The URL to scrape. If null, the URL must be set in the config itself.* @param configXml XML describing how to scrape an individual page.*/protected void scrapeUrl(String url, File configXml){Map setup = new HashMap();if (url!=null)setup.put(START_URL_VARIABLE, url);scrapeUrl(setup, configXml);}/*** Scrapes an individual page, and passed the results on for processing.* @param setup Optional configuration for the script* @param configXml XML describing how to scrape an individual page.*/protected void scrapeUrl(Map setup, File configXml){try{if (LOG.isDebugEnabled())LOG.debug("Starting scrape with temp folder "+working.getAbsolutePath()+"...");// generate a one-off scraper based on preloaded configurationScraperConfiguration config = new ScraperConfiguration(configXml);Scraper scraper = new Scraper(config, working.getAbsolutePath());setupScraperContext(setup, scraper);scraper.execute();// handle contents in some waypageScraped((String)setup.get(START_URL_VARIABLE), scraper.getContext());if (LOG.isDebugEnabled())LOG.debug("Page scraping complete.");}catch (FileNotFoundException e){if (LOG.isErrorEnabled())LOG.error("Could not find config file '"+configXml.getAbsolutePath()+"' - no scraping was done for this WebHarvest XML.", e);}finally{working.delete();}}/*** @param setup Any variables to be set before the script runs.* @param scraper The object which does the scraping.*/private void setupScraperContext(Map setup, Scraper scraper){if (setup!=null)for (String key : setup.keySet())scraper.getContext().setVar(key, setup.get(key));}/*** Process a page that was scraped.* @param url The URL that was scraped.* @param context The contents of the scraped page.*/public abstract void pageScraped(String url, ScraperContext context);}

Scraping a new set of data then becomes as simple as extending theclass, passing in appropriate config, and pulling out whatever variablesyou want every time a page is scraped:

viewplaincopyto clipboardprint?

package scrape;
import org.webharvest.runtime.ScraperContext;
import org.webharvest.runtime.variables.Variable;
public class ActualScraper extends QuickScraper
{
public static void main(String[] args)
{
try
{
ActualScraper scraper = new ActualScraper();
// do the scraping
scraper.scrapeUrlList(config, "config/se.urls.xml", "config/se.page.xml");
}
catch (Exception e)
{
e.printStackTrace();
}
}
/**
* @see scrape.QuickScraper#pageScraped(java.lang.String, org.webharvest.runtime.ScraperContext)
*/
public void pageScraped(String url, ScraperContext context)
{
Variable nameVar = context.getVar("name");
if (nameVar==null)
{
if (LOG.isWarnEnabled()) LOG.warn("Scrape for "+url+" produced no data! Ignoring");
return;
}
// store this station's details
if (LOG.isInfoEnabled()) LOG.info(name+" has "+context.getVar("screen.width").toString()+"x"+context.getVar("screen.height").toString()+" screen");
}
}

package scrape;import org.webharvest.runtime.ScraperContext;import org.webharvest.runtime.variables.Variable;public class ActualScraper extends QuickScraper{public static void main(String[] args){try{ActualScraper scraper = new ActualScraper();// do the scrapingscraper.scrapeUrlList(config, "config/se.urls.xml", "config/se.page.xml");}catch (Exception e){e.printStackTrace();}}/*** @see scrape.QuickScraper#pageScraped(java.lang.String, org.webharvest.runtime.ScraperContext)*/public void pageScraped(String url, ScraperContext context){Variable nameVar = context.getVar("name");if (nameVar==null){if (LOG.isWarnEnabled())LOG.warn("Scrape for "+url+" produced no data! Ignoring");return;}// store this station's detailsif (LOG.isInfoEnabled())LOG.info(name+" has "+context.getVar("screen.width").toString()+"x"+context.getVar("screen.height").toString()+" screen");}}

Soi there you have it – a powerful, configurable and highly effectiveweb scraping system with almost no code written!

WebHarvest: Easy Web Scraping from Java?|?masochismtango DWR - Easy AJAX for JAVA | Getahead HOWTO: Setup easy web development environment (XAMPP) Java Web Framework综述 java web框架比较 Java Web Start介绍 Java Web开发构想 java web框架比较 XFire: The easy and simple way to develop Web... Examples from The Java Developers Almanac 1.4 Access USB devices from Java applications The Easy Way to Extract Useful Text from Arbitrary HTML - AI Depot EECS Course WEB Sites from UCB Lessons From Slidesharegate ~ Stephen's Web ~... 利用 Java Web Start发布你用java程序对Java Applet和Java Web Start进行数字签名 Java Web应用中的任务调度基于Java的Web应用开发规范 Java开源Web爬虫类别列表 Web Services犹如1995年的Java 在Java web服务器内使用urlrewrite Sun开源Java 提升到Web 2.0 最后完善的JAVA WEB服务器源代码基于Java的Web服务器工作原理