Saturday, June 28, 2014

Simple Web Crawler with crawler4j

A web crawler is an Internet Bot that systematically browser the web. The purpose of this browsing may vary, typically for web indexing or collecting a specific form of articles. Typical approach of the crawler is first some seed urls are added and the crawler will repeatedly browse all the links in the initial list and add the links to the list and so on.

Crawler4j is a java library that will extremely simplify the process of creating the web crawler. crawler4j library and its dependencies can be downloaded from the bellow page.

https://code.google.com/p/crawler4j/downloads/list

Basically you have to create the crawler by extending the WebCrawler and create a controller for that. In the crawler you have to override two basic method. They are,

  • shouldVisit - this method is called when visiting a given URl to determine whether it should be visited or not.
  • visit - this method is called when the contents of the given URL is downloaded successfully. You can easily access the URl and the contents of the page from this method.
Bellow is simple implementation of the shouldVisit method and it will access the pages in the same domain as the added seed and will avoid from css, js and media files. First you can create a pattern to avoid such types of pages.



1
Pattern filters = Pattern.compile(".*(\\.(css|js|bmp|gif|jpe?g" + "|png|tiff?|mid|mp2|mp3|mp4" + "|wav|avi|mov|mpeg|ram|m4v|pdf" + "|rm|smil|wmv|swf|wma|zip|rar|gz))$");

Now we can override the shouldVisit method.


1
2
3
4
5
@Override
public boolean shouldVisit(WebURL url) {
        String href = url.getURL().toLowerCase();
        return !filters.matcher(href).matches() && href.startsWith("http://www.lankadeepa.lk/");
}

You can override the visit method and print the details of the accessed pages.


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
@Override
public void visit(Page page) {
        String url = page.getWebURL().getURL();
        System.out.println("Visited: " + url);

        if (page.getParseData() instanceof HtmlParseData) {
                HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
                String text = htmlParseData.getText();
                String html = htmlParseData.getHtml();
                List<WebURL> links = htmlParseData.getOutgoingUrls();

                System.out.println("Text length: " + text.length());
                System.out.println("Html length: " + html.length());
                System.out.println("Number of outgoing links: " + links.size());
        }

}

Now you have to create the controller class.


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
public class Controller {

        public static void main(String[] args) throws Exception {
                String rootFolder = "data/crowler";
                int numberOfCrawlers = 1;

                CrawlConfig config = new CrawlConfig();
                config.setCrawlStorageFolder(rootFolder);
                config.setMaxPagesToFetch(4);
                config.setPolitenessDelay(1000);
                config.setMaxDepthOfCrawling(10);
                config.setProxyHost("cache.mrt.ac.lk");
                config.setProxyPort(3128);

                PageFetcher pageFetcher = new PageFetcher(config);
                RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
                RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
                CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer);

                controller.addSeed("http://www.lankadeepa.lk/");
                controller.start(Crawler.class, numberOfCrawlers);

        }

}

From the controller class you can control the number of crawlers created, maximum number of pages to visit, maximum depth of crawling and add proxy settings of needed. And from here you can add the seeds of crawling. Seeds is the initial list of crawling. When the pages are visited, all the links in the are added to the list and list will grow eventually.

Now you can start crawling. If you want to access any details of the crawled pages, it can be easily done in the visit method. As you can see it contains all the details of the web page including the html of the page. If you want to extract and details from the page, this is the place to do it.