WebHarvest: Easy Web Scraping from Java?|?masochismtango

来源:百度文库 编辑:神马文学网 时间:2024/06/30 21:13:47

WebHarvest: Easy Web Scraping from Java

// February 15th, 2010 // Dev, Web

I’ve been experimenting with data visualisation for awhile now, most of which is for Masabi’sbusiness plan though I hope to share some offshoots soon.

I often have a need to quickly scrape some data out of a web page (orlist of web pages), which can then be fed into Excel and on tospecialist data visualisation tools like Tableau (available in a free public editionhere – my initial impressions are positive but it’s early days yet).

To this end I have turned to WebHarvest, an excellentscriptable open source API for web scraping in Java. I really reallylike it, but there are some quirks and setup issues that have cost mehours so I thought I’d roll together a tutorial with the fixes.

WebHarvest Config for Maven

When it works Maven is alovely tool to hide dependency management for Java projects, butWebHarvest is not configured qiute right out of the box to worktransparently with it. (Describing Maven is beyond the scope of thispost, but if you don’t know it, it’s easy to setup with the M2 plugin for Eclipse.)

This is the Maven POM I ended up with to use WebHarvest in a newJavaSE project:

viewplaincopyto clipboardprint?
  1.  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">  
  2.  4.0.0  
  3.  WebScraping  
  4.  WebScraping  
  5.  jar  
  6.  0.00.01  
  7.    
  8.  UTF-8  
  9.    
  10.   
  11.    
  12.    
  13.    
  14.  maven-compiler-plugin  
  15.    
  16.  1.6  
  17.  1.6  
  18.    
  19.    
  20.    
  21.    
  22.   
  23.    
  24.    
  25.  wso2  
  26.  http://dist.wso2.org/maven2/  
  27.    
  28.    
  29.  maven-repository-1  
  30.  http://repo1.maven.org/maven2/  
  31.    
  32.    
  33.    
  34.    
  35.  commons-logging  
  36.  commons-logging  
  37.  1.1  
  38.  jar  
  39.  compile  
  40.    
  41.    
  42.  log4j  
  43.  log4j  
  44.  1.2.12  
  45.  jar  
  46.  compile  
  47.    
  48.    
  49.  org.webharvest.wso2  
  50.  webharvest-core  
  51.  1.0.0.wso2v1  
  52.  jar  
  53.  compile  
  54.    
  55.    
  56.    
  57.  net.sf.saxon  
  58.  saxon-xom  
  59.  8.7  
  60.    
  61.    
  62.  org.htmlcleaner  
  63.  htmlcleaner  
  64.  1.55  
  65.    
  66.    
  67.  bsh  
  68.  bsh  
  69.  1.3.0  
  70.    
  71.    
  72.  commons-httpclient  
  73.  commons-httpclient  
  74.  3.1  
  75.    
  76.    
  77.   

You’ll note that the WebHarvest dependencies had to be addedexplicitly, because the jar does not come with a working pom listingthem.

Writing A Scraping Script

WebHarvest uses XML configuration files to describe how to scrape asite – and with a few lines of Java code you can run any XMLconfiguration and have access to any properties that the scriptidentified from the page. This is definitely the safest way to scrapedata, as it decouples the code from the web page markup – so if the siteyou are scraping goes through a redesign, you can quickly adjust theconfig files without recompiling the code they pass data to.

The site some good example scriptsto show you how to get started, so I won’t repeat them here. Theeasiest way to create your own is to run the WebHarvest GUI from thecommand line, start with a sample script, and then hack it around to getwhat you want – it’s an easy iterative process with good feedback inthe UI.

As a simple example, this is a script to go to the Sony-Ericssondeveloper site’s handset gallery at http://developer.sonyericsson.com/device/searchDevice.do?restart=true,and rip each handset’s individual spec page URI:

viewplaincopyto clipboardprint?
  1.   
  2.   
  3.       
  4.       
  5.           
  6.           
  7.               
  8.                   
  9.                       
  10.                   
  11.               
  12.           
  13.           
  14.           
  15.               
  16.               
  17.                   
  18.                   
  19.               
  20.           
  21.       
  22.   

The handset URIs will end up in a list of variables, from uri.1to uri.N.

The XML configuration’s syntax can take a little getting used to – itappeared quite backwards to me at first, but by messing around in theGUI you can experiment and learn pretty fast. With a basicunderstanding of XPathto identify parts of the web page, and perhaps a little regularexpression knowledge to get at information surrounded by plain text,you can perform some very powerful scraping.

We can then define another script which will take this URI, and pullout a piece of information from the page – in this example, it will showthe region(s) that the handset was released in:

viewplaincopyto clipboardprint?
  1.   
  2.   
  3.       
  4.       
  5.           
  6.               
  7.               
  8.           
  9.       
  10.       
  11.       
  12.           
  13.               
  14.               
  15.           
  16.           
  17.       
  18.           
  19.               
  20.               
  21.       
  22.       
  23.       
  24.         ([\d]*)x([\d]*)  
  25.               
  26.                   
  27.                       
  28.                   
  29.               
  30.           
  31.               
  32.               
  33.           
  34.       
  35.   

At this point I should note the biggest gotcha with WebHarvest, thatjust caused me 3 hours of hear tearing. In the script, this linedefines the page to scrape: ${uri}"/>,where ${uri} is a variable specified at runtime to define aURI. This works.

If you were to substitute in this perfectly sensible alternative: ${url}"/>, you would end up with acompletely obscure runtime exception a little like this:

Exception in thread "main" org.webharvest.exception.ScriptException: Cannot set variable in scripter: Field access: bsh.ReflectError: No such field: 1at org.webharvest.runtime.scripting.BeanShellScriptEngine.setVariable(Unknown Source)at org.webharvest.runtime.scripting.ScriptEngine.pushAllVariablesFromContextToScriptEngine(Unknown Source)at org.webharvest.runtime.scripting.BeanShellScriptEngine.eval(Unknown Source)at org.webharvest.runtime.templaters.BaseTemplater.execute(Unknown Source)at org.webharvest.runtime.processors.TemplateProcessor.execute(Unknown Source)at org.webharvest.runtime.processors.BaseProcessor.run(Unknown Source)at org.webharvest.runtime.processors.BodyProcessor.execute(Unknown Source)at org.webharvest.runtime.processors.VarDefProcessor.execute(Unknown Source)at org.webharvest.runtime.processors.BaseProcessor.run(Unknown Source)at org.webharvest.runtime.processors.BodyProcessor.execute(Unknown Source)at org.webharvest.runtime.processors.BaseProcessor.run(Unknown Source)at org.webharvest.runtime.processors.LoopProcessor.execute(Unknown Source)at org.webharvest.runtime.processors.BaseProcessor.run(Unknown Source)at org.webharvest.runtime.Scraper.execute(Unknown Source)at org.webharvest.runtime.Scraper.execute(Unknown Source)at scrape.QuickScraper.scrapeUrlList(QuickScraper.java:82)at scrape.QuickScraper.scrapeUrlList(QuickScraper.java:49)at scrape.ActualScraper.main(DhfScraper.java:37)Caused by: Field access: bsh.ReflectError: No such field: 1 : at Line: -1 : in file:  : at bsh.UtilEvalError.toEvalError(Unknown Source)at bsh.UtilEvalError.toEvalError(Unknown Source)at bsh.Interpreter.set(Unknown Source)... 18 more

You have been warned!

Running The Scripts From Java

WebHarvest requires very little code to run. I created this littlereusable harness class to quickly run the two types of script – one topull information from a page, and one to farm URLs from which to scrapedata. You can use the first without the second, of course.

viewplaincopyto clipboardprint?
  1. package scrape;  
  2.   
  3. import java.io.*;  
  4. import java.util.*;  
  5.   
  6. import org.apache.commons.logging.*;  
  7. import org.webharvest.definition.ScraperConfiguration;  
  8. import org.webharvest.runtime.*;  
  9. import org.webharvest.runtime.variables.Variable;  
  10.   
  11. /** 
  12.  * Quick hackable web scraping class. 
  13.  * @author Tom Godber 
  14.  */  
  15. public abstract class QuickScraper  
  16. {  
  17.     /** Logging object. */  
  18.     protected final Log LOG = LogFactory.getLog(getClass());  
  19.     /** Prefix for any variable scraped which defines a URL. It will be followed by a counter. */  
  20.     public static final String SCRAPED_URL_VARIABLE_PREFIX = "url.";  
  21.     /** A variable name which holds the initial URL to scrape. */  
  22.     public static final String START_URL_VARIABLE = "url";  
  23.   
  24.     /** A temporary working folder. */  
  25.     private File working = new File("temp");  
  26.   
  27.     /** Ensures temp folder exists.` */  
  28.     public QuickScraper()  
  29.     {  
  30.         working.mkdirs();  
  31.     }  
  32.   
  33.     /** 
  34.      * Scrapes a list of URLs which are automatically derived from a page. 
  35.      * The initial URL must be set in the actual URL list config XML. 
  36.      * @param urlConfigXml Path of an XML describing how to scrape the URL list. 
  37.      * @param pageConfigXml Path of an XML describing how to scrape the individual pages found.# 
  38.      * @return The number of URLs processed, or -1 if the config could not be loaded. 
  39.      */  
  40.     protected int scrapeUrlList(String urlConfigXml, String pageConfigXml)  
  41.     {  
  42.         return scrapeUrlList(new HashMap(), urlConfigXml, pageConfigXml);  
  43.     }  
  44.   
  45.     /** 
  46.      * Scrapes a list of URLs which are automatically derived from a page. 
  47.      * @param setup Optional configuration for the script 
  48.      * @param urlConfigXml Path of an XML describing how to scrape the URL list. 
  49.      * @param pageConfigXml Path of an XML describing how to scrape the individual pages found.# 
  50.      * @return The number of URLs processed, or -1 if the config could not be loaded. 
  51.      */  
  52.     protected int scrapeUrlList(Map setup, String urlConfigXml, String pageConfigXml)  
  53.     {  
  54.         return scrapeUrlList(setup, new File(urlConfigXml), new File(pageConfigXml));  
  55.     }  
  56.   
  57.     /** 
  58.      * Scrapes a list of URLs which are automatically derived from a page. 
  59.      * The initial URL must be set in the actual URL list config XML. 
  60.      * @param urlConfigXml XML describing how to scrape the URL list. 
  61.      * @param pageConfigXml XML describing how to scrape the individual pages found.# 
  62.      * @return The number of URLs processed, or -1 if the config could not be loaded. 
  63.      */  
  64.     protected int scrapeUrlList(File urlConfigXml, File pageConfigXml)  
  65.     {  
  66.         return scrapeUrlList(new HashMap(), urlConfigXml, pageConfigXml);  
  67.     }  
  68.   
  69.     /** 
  70.      * Scrapes a list of URLs which are automatically derived from a page. 
  71.      * @param setup Optional configuration for the script 
  72.      * @param urlConfigXml XML describing how to scrape the URL list. 
  73.      * @param pageConfigXml XML describing how to scrape the individual pages found. 
  74.      * @return The number of URLs processed, or -1 if the config could not be loaded. 
  75.      * @throws NullPointerException If the setup map is null. 
  76.      */  
  77.     protected int scrapeUrlList(Map setup, File urlConfigXml, File pageConfigXml)  
  78.     {  
  79.         try  
  80.         {  
  81.             if (LOG.isDebugEnabled())   LOG.debug("Starting scrape with temp folder "+working.getAbsolutePath()+"...");  
  82.             // generate a one-off scraper based on preloaded configuration  
  83.             ScraperConfiguration config = new ScraperConfiguration(urlConfigXml);  
  84.             Scraper scraper = new Scraper(config, working.getAbsolutePath());  
  85.             // initialise any config  
  86.             setupScraperContext(setup, scraper);  
  87.             // run the script  
  88.             scraper.execute();  
  89.   
  90.             // rip the URL list out of the scraped content  
  91.             ScraperContext context = scraper.getContext();  
  92.             int i=1;  
  93.             Variable scrapedUrl;  
  94.             if (LOG.isDebugEnabled())   LOG.debug("Scraping performed, pulling URLs '"+SCRAPED_URL_VARIABLE_PREFIX+"n' from "+context.size()+" variables, starting with "+i+"...");  
  95.             while ((scrapedUrl = (Variable) context.get(SCRAPED_URL_VARIABLE_PREFIX+i))  != null)  
  96.             {  
  97.                 if (LOG.isTraceEnabled())   LOG.trace("Found "+SCRAPED_URL_VARIABLE_PREFIX+i+": "+scrapedUrl.toString());  
  98.                 // parse this URL  
  99.                 setup.put(START_URL_VARIABLE, scrapedUrl.toString());  
  100.                 scrapeUrl(setup, pageConfigXml);  
  101.                 // move on  
  102.                 i++;  
  103.             }  
  104.             if (LOG.isDebugEnabled())   LOG.debug("No more URLs found.");  
  105.             return i;  
  106.         }  
  107.         catch (FileNotFoundException e)  
  108.         {  
  109.             if (LOG.isErrorEnabled())   LOG.error("Could not find config file '"+urlConfigXml.getAbsolutePath()+"' - no scraping was done for this WebHarvest XML.", e);  
  110.             return -1;  
  111.         }  
  112.         finally  
  113.         {  
  114.             working.delete();  
  115.         }  
  116.     }  
  117.   
  118.     /** 
  119.      * Scrapes an individual page, and passed the results on for processing. 
  120.      * The script must contain a hardcoded URL. 
  121.      * @param configXml XML describing how to scrape an individual page. 
  122.      */  
  123.     protected void scrapeUrl(File configXml)  
  124.     {  
  125.         scrapeUrl((String)null, configXml);  
  126.     }  
  127.   
  128.     /** 
  129.      * Scrapes an individual page, and passed the results on for processing. 
  130.      * @param url The URL to scrape. If null, the URL must be set in the config itself. 
  131.      * @param configXml XML describing how to scrape an individual page. 
  132.      */  
  133.     protected void scrapeUrl(String url, File configXml)  
  134.     {  
  135.         Map setup = new HashMap();  
  136.         if (url!=null)  setup.put(START_URL_VARIABLE, url);  
  137.         scrapeUrl(setup, configXml);  
  138.     }  
  139.   
  140.     /** 
  141.      * Scrapes an individual page, and passed the results on for processing. 
  142.      * @param setup Optional configuration for the script 
  143.      * @param configXml XML describing how to scrape an individual page. 
  144.      */  
  145.     protected void scrapeUrl(Map setup, File configXml)  
  146.     {  
  147.         try  
  148.         {  
  149.             if (LOG.isDebugEnabled())   LOG.debug("Starting scrape with temp folder "+working.getAbsolutePath()+"...");  
  150.             // generate a one-off scraper based on preloaded configuration  
  151.             ScraperConfiguration config = new ScraperConfiguration(configXml);  
  152.             Scraper scraper = new Scraper(config, working.getAbsolutePath());  
  153.             setupScraperContext(setup, scraper);  
  154.             scraper.execute();  
  155.   
  156.             // handle contents in some way  
  157.             pageScraped((String)setup.get(START_URL_VARIABLE), scraper.getContext());  
  158.   
  159.             if (LOG.isDebugEnabled())   LOG.debug("Page scraping complete.");  
  160.         }  
  161.         catch (FileNotFoundException e)  
  162.         {  
  163.             if (LOG.isErrorEnabled())   LOG.error("Could not find config file '"+configXml.getAbsolutePath()+"' - no scraping was done for this WebHarvest XML.", e);  
  164.   
  165.         }  
  166.         finally  
  167.         {  
  168.             working.delete();  
  169.         }  
  170.     }  
  171.   
  172.     /** 
  173.      * @param setup Any variables to be set before the script runs. 
  174.      * @param scraper The object which does the scraping. 
  175.      */  
  176.     private void setupScraperContext(Map setup, Scraper scraper)  
  177.     {  
  178.         if (setup!=null)  
  179.             for (String key : setup.keySet())  
  180.                 scraper.getContext().setVar(key, setup.get(key));  
  181.     }  
  182.   
  183.     /** 
  184.      * Process a page that was scraped. 
  185.      * @param url The URL that was scraped. 
  186.      * @param context The contents of the scraped page. 
  187.      */  
  188.     public abstract void pageScraped(String url, ScraperContext context);  
  189. }  
  190.   

Scraping a new set of data then becomes as simple as extending theclass, passing in appropriate config, and pulling out whatever variablesyou want every time a page is scraped:

viewplaincopyto clipboardprint?
  1. package scrape;  
  2.   
  3. import org.webharvest.runtime.ScraperContext;  
  4. import org.webharvest.runtime.variables.Variable;  
  5.   
  6. public class ActualScraper extends QuickScraper  
  7. {  
  8.     public static void main(String[] args)  
  9.     {  
  10.         try  
  11.         {  
  12.             ActualScraper scraper = new ActualScraper();  
  13.             // do the scraping  
  14.             scraper.scrapeUrlList(config, "config/se.urls.xml", "config/se.page.xml");  
  15.         }  
  16.         catch (Exception e)  
  17.         {  
  18.             e.printStackTrace();  
  19.         }  
  20.     }  
  21.   
  22.     /** 
  23.      * @see scrape.QuickScraper#pageScraped(java.lang.String, org.webharvest.runtime.ScraperContext) 
  24.      */  
  25.     public void pageScraped(String url, ScraperContext context)  
  26.     {  
  27.         Variable nameVar = context.getVar("name");  
  28.         if (nameVar==null)  
  29.         {  
  30.             if (LOG.isWarnEnabled())    LOG.warn("Scrape for "+url+" produced no data! Ignoring");  
  31.             return;  
  32.         }  
  33.   
  34.         // store this station's details  
  35.         if (LOG.isInfoEnabled())    LOG.info(name+" has "+context.getVar("screen.width").toString()+"x"+context.getVar("screen.height").toString()+" screen");  
  36.     }  
  37. }  

Soi there you have it – a powerful, configurable and highly effectiveweb scraping system with almost no code written!