Wednesday, October 31, 2012

Java: Parsing an HTML page

I wanted to share an example code showing how to parse an HTML page using the open library HTML Parser. I used this library in the past for a project where we required to extract all links of a site, and now I’m about to use it again for a project where we need to validate certain HTML coding rules. This library is simple and very straightforward to use. So let’s assume we need to find all the absolute URL’s referenced in a page. First, we create our parser class:

import java.io.IOException;
import java.net.URL;
import org.htmlparser.Node;
import org.htmlparser.Tag;
import org.htmlparser.lexer.Lexer;
import org.htmlparser.util.ParserException;

/**
 * Parses the HTML code of the page specified by it's URL.
 * @author gabriel.solano
 *
 */
public class URLHTMLParser {
 
 /*
  * Tag handler that will be used to process the tags.
  * (This could be improved by implementing an observer 
  * pattern to be able to add more than one TagHandler)
  */
 private TagHandler tagHandler;
 
 /**
  * Constructor.
  * @param tagHandler
  */
 public URLHTMLParser(TagHandler tagHandler) {
  this.tagHandler = tagHandler;
 }
 
 /**
  * Scans the specified URL.
  * @param url
  * @throws ParserException
  * @throws IOException
  */
 public void scanURL(URL url) throws ParserException, IOException {
  Lexer lexer = new Lexer(url.openConnection());
  extractHTMLNodes(lexer);
 }
 
 /**
  * Extracts the HTML nodes and lets the TagHandler to do something
  * with the tags.
  * @param lexer
  * @throws ParserException
  */
 private void extractHTMLNodes(Lexer lexer) throws ParserException {
  Node node;

  while (null != (node = lexer.nextNode(false))) {  
   if (node instanceof Tag) {
    Tag tag = (Tag) node;
    tagHandler.handleTag(tag);
   }
  }
 }
}

As you can see, the last function of this class is in charge of moving across the HTML nodes. I just let the TagHandler class to do whatever is required with the tag. This is the interface for the TagHandler:
import org.htmlparser.Tag;

/**
 * Defines the interface for a TagHandler.
 * @author gabriel.solano
 *
 */
public interface TagHandler {
 
 /**
  * Handles the process of an HTML tag.
  * @param tag
  */
 public void handleTag(Tag tag);
 
}

And here’s my implementation to handle anchor tags:

import java.util.HashSet;
import java.util.Set;

import org.htmlparser.Tag;

/**
 * Handles the event when an anchor tag is found while parsing 
 * HTML code of a page.
 * This class has a functionality to count all absolute URLs
 * found in the parsing process.
 * @author gabriel.solano
 *
 */
public class AnchorTagHandler implements TagHandler{

 private Set<String> absoluteURLs; // All URLs found.
 
 /**
  * Constructor.
  */
 public AnchorTagHandler() {
  absoluteURLs = new HashSet<String>();
 }
 
 /**
  * Gets the found absolute URLs. 
  * The collection is filled only during the scanning process
  * of an HTML page.
  * @return
  */
 public Set<String> getAbsoluteURLs() {
  return absoluteURLs;
 }
 
 /**
  * Handles the tag only if it is an anchor tag.
  */
 public void handleTag(Tag tag) {
  if (tag.getTagName().equalsIgnoreCase("a")) { 
   // Process only if it's an anchor tag.
   processTag(tag);
  }
 }

 /**
  * Processes the anchor tag. In this case 
  * adds all absolute URL's found.
  * @param tag
  */
 private void processTag(Tag tag) {
  String href = tag.getAttribute("href");
  
  if (href != null) {
   href = href.toLowerCase();   
   if (href.startsWith("http://") || href.startsWith("https://")) {
    // Add all URLs with HTTP protocol.
    absoluteURLs.add(href);
   }
  }  
 }
}
The “processTag” function simply extracts the “href” attribute and verifies if it is an absolute URL. Finally we just create a main class to run the code:
import java.net.URL;
import java.util.Set;

public class FindAbsoluteURLs {
 
 public static void main(String[] args) {
  
  AnchorTagHandler anchorTagHandler = new AnchorTagHandler();  
  URLHTMLParser htmlParser = new  URLHTMLParser(anchorTagHandler);
  
  try {
   htmlParser.scanURL(new URL("http://www.crjug.org/"));
   Set<String> urls = anchorTagHandler.getAbsoluteURLs();
   
   for(String url : urls) {
    System.out.println(url);
   }
   
  } catch (Exception e) {   
   e.printStackTrace();
  } 
 }
}
Here’s the maven dependency in case you need to use this helpful library:
<java>
<dependency>
   <groupId>org.htmlparser</groupId>
   <artifactId>htmlparser</artifactId>
   <version>1.6</version>
</dependency>
</java>

No comments:

Post a Comment