Help Needed for Web Scraping
Tuesday, February 20th 2007 | Ismael Ghalimi
The good folks at Zoho recently added a new feature to Zoho Sheet that allows external tabular data to be dynamically embedded within a spreadsheet, then syndicated back through JSON and RSS feeds. It’s pretty cool really, for it allows one to perform some complex web scraping, without having to install any piece of software. Nevertheless, it only works with tabular data, and most data I would like to scrape from the web tends to be found outside of tables. Would anyone know about some online service that I could use for such a purpose?
Here is a scenario for it. As I am compiling my Weekly Office 2.0 Roundup, I need to gather data from Alexa and Google. For the later, I use the free Page Rank Checker. As I create a Top 10 list of services, I need to make 10 requests to Alexa, and 10 others to Page Rank Checker. Pretty boring, if you ask me. What I am looking for is a service that would allow me to create a JSON or RSS feed for any piece of data contained within any publicly-accessible web page. For example, I would like to fetch the Alexa Rank for www.intalio.com, which today is 67,106.
The reader who finds the best solution will get a free IT|Redux branded iPod nano.
Entry filed under: Office 2.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|











Well, for the Alexa information, most of it is available through the Alexa Web Information Service provided by Amazon. You would have to write a simple script that would pull this, and then output as JSON or RSS.
Alternatively, and more generically, you could use a service like Dapper that will walk you through a wizard and scrape information off of pages, then output it in many different formats. Its definitely worth a try, but I didn’t have any luck with the Alexa pages.
Another tool that is new is OpenKapow. I have not tried this one, but it looks promising based on the sample content out there.
I’m guessing you want something like Dapper, although it seems to be down now.
Take a look at scRUBYt! This is not necessarily the tool you may be looking for, but I found it pretty interesting nonetheless.
Congratulation on the new addition to the family.
-Ryan
Take a peek at one of these services: OpenKapow, Ponyfish, screen-scraper.
Check out the ‘Content Analysis’ functionality in Yahoo! Pipes. This pipe that mashes up the New York Times front page with Flickr provides a good example.
Nanek,
I could use Alexa’s web service indeed, but I need one for Google too.
Dapper looks pretty cool indeed. I’ll give it a shot.
OpenKapow seems to require the installation of software, so it’s not an option.
Best regards
-Ismael
Assaf,
Thanks for the tip. Looks good indeed.
Best regards
-Ismael
Mike,
OpenKapow and screen-scraper are software based. Not an option.
I will give Ponyfish a shot though.
Best regards
-Ismael
Cory,
That’s a great idea! I’ll give it a shot.
Best regards
-Ismael
OpenKapow only requires software to build the API’s. When built, everyone on the web can access them from the OpenKapow service, without any plugin or software. Thus, only you building it need to download the software, nobody else needs it. What you download from OpenKapow is basically a power web browser, with built-in visual scripting that allows you to build REST, JSON, RSS, and ATOM services, and deploy them on OpenKapow. Doing what you need should only take 15 minutes, after you ran over the OpenKapow tutorial and downloaded the software.
-Stefan
Stefan,
Thanks for the clarification. I’ll give it a try then.
Any plans to develop a version that would not require any software at all?
Best regards
-Ismael
Clipmarks looks interesting as well.
Mike,
Thanks for sharing. I’ll take a look.
Best regards
-Ismael
Need web scraping? You name the website, we scrape the data for you.
Hi Ismael,
As a result of reading this blog entry and trying out some of the services mentioned in the comments, I decided to go ahead and create my own “simple” solution… You can read about it here.
Mike,
I will give it a shot!
-Ismael
Trackback this post | Subscribe to the comments via RSS Feed
Leave a Comment