IT|Redux

Help Needed for Web Scraping

Tuesday, February 20th 2007 | Ismael Ghalimi

The good folks at Zoho recently added a new feature to Zoho Sheet that allows external tabular data to be dynamically embedded within a spreadsheet, then syndicated back through JSON and RSS feeds. It’s pretty cool really, for it allows one to perform some complex web scraping, without having to install any piece of software. Nevertheless, it only works with tabular data, and most data I would like to scrape from the web tends to be found outside of tables. Would anyone know about some online service that I could use for such a purpose?

Here is a scenario for it. As I am compiling my Weekly Office 2.0 Roundup, I need to gather data from Alexa and Google. For the later, I use the free Page Rank Checker. As I create a Top 10 list of services, I need to make 10 requests to Alexa, and 10 others to Page Rank Checker. Pretty boring, if you ask me. What I am looking for is a service that would allow me to create a JSON or RSS feed for any piece of data contained within any publicly-accessible web page. For example, I would like to fetch the Alexa Rank for www.intalio.com, which today is 67,106.

The reader who finds the best solution will get a free IT|Redux branded iPod nano.

Entry filed under: Office 2.0

16 Comments - Add a comment

1. Nanek  |  February 21st, 2007 at 7:31 pm

Well, for the Alexa information, most of it is available through the Alexa Web Information Service provided by Amazon. You would have to write a simple script that would pull this, and then output as JSON or RSS.

Alternatively, and more generically, you could use a service like Dapper that will walk you through a wizard and scrape information off of pages, then output it in many different formats. Its definitely worth a try, but I didn’t have any luck with the Alexa pages.

Another tool that is new is OpenKapow. I have not tried this one, but it looks promising based on the sample content out there.

2. Assaf  |  February 21st, 2007 at 10:18 pm

I’m guessing you want something like Dapper, although it seems to be down now.

3. Ryan Armasu  |  February 21st, 2007 at 11:04 pm

Take a look at scRUBYt! This is not necessarily the tool you may be looking for, but I found it pretty interesting nonetheless.

Congratulation on the new addition to the family. -Ryan

4. Mike Parsons  |  February 22nd, 2007 at 3:25 am

Take a peek at one of these services: OpenKapow, Ponyfishscreen-scraper.

5. Cory  |  February 22nd, 2007 at 6:07 am

Check out the ‘Content Analysis’ functionality in Yahoo! Pipes. This pipe that mashes up the New York Times front page with Flickr provides a good example.

6. Ismael Ghalimi  |  February 22nd, 2007 at 2:20 pm

Nanek,

I could use Alexa’s web service indeed, but I need one for Google too.

Dapper looks pretty cool indeed. I’ll give it a shot.

OpenKapow seems to require the installation of software, so it’s not an option.

Best regards -Ismael

7. Ismael Ghalimi  |  February 22nd, 2007 at 2:20 pm

Assaf,

Thanks for the tip. Looks good indeed.

Best regards -Ismael

8. Ismael Ghalimi  |  February 22nd, 2007 at 2:21 pm

Mike,

OpenKapow and screen-scraper are software based. Not an option.

I will give Ponyfish a shot though.

Best regards -Ismael

9. Ismael Ghalimi  |  February 22nd, 2007 at 2:22 pm

Cory,

That’s a great idea! I’ll give it a shot.

Best regards -Ismael

10. Stefan Andreasen  |  February 22nd, 2007 at 3:16 pm

OpenKapow only requires software to build the API’s. When built, everyone on the web can access them from the OpenKapow service, without any plugin or software. Thus, only you building it need to download the software, nobody else needs it. What you download from OpenKapow is basically a power web browser, with built-in visual scripting that allows you to build REST, JSON, RSS, and ATOM services, and deploy them on OpenKapow. Doing what you need should only take 15 minutes, after you ran over the OpenKapow tutorial and downloaded the software. -Stefan

11. Ismael Ghalimi  |  February 22nd, 2007 at 3:32 pm

Stefan,

Thanks for the clarification. I’ll give it a try then.

Any plans to develop a version that would not require any software at all?

Best regards -Ismael

12. Mike Parsons  |  February 27th, 2007 at 6:37 am

Clipmarks looks interesting as well.

13. Ismael Ghalimi  |  February 28th, 2007 at 3:44 pm

Mike,

Thanks for sharing. I’ll take a look.

Best regards -Ismael

14. Jack Pipe  |  March 7th, 2007 at 11:45 am

Need web scraping? You name the website, we scrape the data for you.

15. Mike Parsons  |  March 21st, 2007 at 12:14 pm

Hi Ismael,

As a result of reading this blog entry and trying out some of the services mentioned in the comments, I decided to go ahead and create my own “simple” solution… You can read about it here.

16. Ismael Ghalimi  |  March 21st, 2007 at 1:09 pm

Mike,

I will give it a shot! -Ismael

Trackback this post  |  Subscribe to the comments via RSS Feed

Leave a Comment

Required

Required, hidden