Friday, December 26, 2014

A web scraper/crawler in Java: Krawkraw

UPDATE: Krawkraw is now referred to as just Webmuncher.

Nobody sets out to write a web scraper, but here I am posting about just that, a general purpose web scraper I wrote…

How did this happen? Well, it started with wanting to play around with ElasticSearch a couple of months back. And in search of some data to index, I looked to websites' contents (with hindsight a more structured data format should have been sought, but well, the lesson has been learnt)...

So the question then was, how do I easily retrieve all the html pages of a site and have it thrown into ElasticSearch. Any easy web scraper in Java, out there that I could just use? I can’t even remember actively searching for such a tool...which explains why I somehow decided it would be much fulfilling if I throw a web scraper together...and in a couple of weekends, Krawkraw came to be.

Krawkraw is a tool that can be used to easily retrieve all the contents of a website. More accurately contents under a single domain. This is its perfect use case which reflects the original need for which it was written. So you can’t start at one edge of the web with Krawkraw and expect to crawl to another edge...Nope.

Krawkraw is available via Maven central, and you can easily drop it into your project with this coordinates:


com.blogspot.geekabyte.krawkraw
krawler
${krawkraw.version}


Or you can download the jar files here, and have it included in your classpath.

For more information on the API and its usage, The README should be your friend! If you happen to use Krawkraw and found it missing some features, or you have some ideas on some features that it should have? Then drop them in the issue tracker.

No comments: