Wednesday, October 31, 2012

Java: Parsing an HTML page

I wanted to share an example code showing how to parse an HTML page using the open library HTML Parser. I used this library in the past for a project where we required to extract all links of a site, and now I’m about to use it again for a project where we need to validate certain HTML coding rules. This library is simple and very straightforward to use. So let’s assume we need to find all the absolute URL’s referenced in a page. First, we create our parser class:

import java.io.IOException;
import java.net.URL;
import org.htmlparser.Node;
import org.htmlparser.Tag;
import org.htmlparser.lexer.Lexer;
import org.htmlparser.util.ParserException;

/**
 * Parses the HTML code of the page specified by it's URL.
 * @author gabriel.solano
 *
 */
public class URLHTMLParser {
 
 /*
  * Tag handler that will be used to process the tags.
  * (This could be improved by implementing an observer 
  * pattern to be able to add more than one TagHandler)
  */
 private TagHandler tagHandler;
 
 /**
  * Constructor.
  * @param tagHandler
  */
 public URLHTMLParser(TagHandler tagHandler) {
  this.tagHandler = tagHandler;
 }
 
 /**
  * Scans the specified URL.
  * @param url
  * @throws ParserException
  * @throws IOException
  */
 public void scanURL(URL url) throws ParserException, IOException {
  Lexer lexer = new Lexer(url.openConnection());
  extractHTMLNodes(lexer);
 }
 
 /**
  * Extracts the HTML nodes and lets the TagHandler to do something
  * with the tags.
  * @param lexer
  * @throws ParserException
  */
 private void extractHTMLNodes(Lexer lexer) throws ParserException {
  Node node;

  while (null != (node = lexer.nextNode(false))) {  
   if (node instanceof Tag) {
    Tag tag = (Tag) node;
    tagHandler.handleTag(tag);
   }
  }
 }
}

As you can see, the last function of this class is in charge of moving across the HTML nodes. I just let the TagHandler class to do whatever is required with the tag. This is the interface for the TagHandler:
import org.htmlparser.Tag;

/**
 * Defines the interface for a TagHandler.
 * @author gabriel.solano
 *
 */
public interface TagHandler {
 
 /**
  * Handles the process of an HTML tag.
  * @param tag
  */
 public void handleTag(Tag tag);
 
}

And here’s my implementation to handle anchor tags:

import java.util.HashSet;
import java.util.Set;

import org.htmlparser.Tag;

/**
 * Handles the event when an anchor tag is found while parsing 
 * HTML code of a page.
 * This class has a functionality to count all absolute URLs
 * found in the parsing process.
 * @author gabriel.solano
 *
 */
public class AnchorTagHandler implements TagHandler{

 private Set<String> absoluteURLs; // All URLs found.
 
 /**
  * Constructor.
  */
 public AnchorTagHandler() {
  absoluteURLs = new HashSet<String>();
 }
 
 /**
  * Gets the found absolute URLs. 
  * The collection is filled only during the scanning process
  * of an HTML page.
  * @return
  */
 public Set<String> getAbsoluteURLs() {
  return absoluteURLs;
 }
 
 /**
  * Handles the tag only if it is an anchor tag.
  */
 public void handleTag(Tag tag) {
  if (tag.getTagName().equalsIgnoreCase("a")) { 
   // Process only if it's an anchor tag.
   processTag(tag);
  }
 }

 /**
  * Processes the anchor tag. In this case 
  * adds all absolute URL's found.
  * @param tag
  */
 private void processTag(Tag tag) {
  String href = tag.getAttribute("href");
  
  if (href != null) {
   href = href.toLowerCase();   
   if (href.startsWith("http://") || href.startsWith("https://")) {
    // Add all URLs with HTTP protocol.
    absoluteURLs.add(href);
   }
  }  
 }
}
The “processTag” function simply extracts the “href” attribute and verifies if it is an absolute URL. Finally we just create a main class to run the code:
import java.net.URL;
import java.util.Set;

public class FindAbsoluteURLs {
 
 public static void main(String[] args) {
  
  AnchorTagHandler anchorTagHandler = new AnchorTagHandler();  
  URLHTMLParser htmlParser = new  URLHTMLParser(anchorTagHandler);
  
  try {
   htmlParser.scanURL(new URL("http://www.crjug.org/"));
   Set<String> urls = anchorTagHandler.getAbsoluteURLs();
   
   for(String url : urls) {
    System.out.println(url);
   }
   
  } catch (Exception e) {   
   e.printStackTrace();
  } 
 }
}
Here’s the maven dependency in case you need to use this helpful library:
<java>
<dependency>
   <groupId>org.htmlparser</groupId>
   <artifactId>htmlparser</artifactId>
   <version>1.6</version>
</dependency>
</java>

Monday, October 22, 2012

Not afraid of looking uncool

Sometimes I feel like software development world behave in certain way like fashion mode. I don’t say it because I think it is completely subject of trivial conditions like the influence of a pop star over young teenagers, but in small scale, most popular rock star frameworks tend to monopolize software engineers with recipes, sometimes to the point of thinking that anything distinct from the recipe is out fashion.

Frameworks like Struts, Spring or Hibernate are excellent tools for a lot of development efforts. The problem I think starts when a developer sets his mind to think that all projects should be implemented with the standard “fits all” recipe he uses. If someone else suggests to do something different, or just tells in a friendly conversation that he is using a different approach in an application, the framework recipe guy could look at him as the “uncool” or even mock him as the dinosaur of the team.

I've been watching the lectures from the recent Java Zone 2011 conference, and two short presentations caught my attention for their braveness to questions the “status-quo”. One exposes a different approach for dependency injection:


Dependency injection when you only have one dependency from JavaZone on Vimeo.

And the other one, which is the one that I particularly enjoy the most, is the one from this young lady where she stands firmly, and with good arguments, why she doeen’t like Hibernate.


Hibernate should be to programmers what cake mixes are to bakers: beneath their dignity. from JavaZone on Vimeo.

I’m not taking sides on these two presentations, I must confess that I need more experience to have a more informed position on the specific topics, but I truly admire these two fella for showing their out of the box approach on software development. Innovation comes frequently from setting apart from the rest.

We have a lot to learn as still young developers, and we should always be receptive to new ideas in this always changing business that is the software development world. Some ideas could be crap, but we need to have the humbleness to examine all of them to make a rational judgment on why we discard it. Let’s not be fanatic just because everyone uses a specific tool.