A life spent making mistakes is not only honorable, but also more useful than life spent doing nothing.

Web bots and their use

Posted: March 1st, 2010 | Author: Prabhas Gupte | Filed under: Knowledge | No Comments »

Webbots? What are they??

Internet bots, also known as web robots, WWW robots or simply bots, are nothing but the software applications that run automated tasks over the Internet. Typically, bots perform tasks that are both simple and structurally repetitive, at a much higher rate than would be possible for a human alone.

In general, web bots can do almost everything what a human can do over the Internet. They can simply:

  • browse the web
  • read the web pages
  • fill up and submit the forms
  • read the emails and respond automatically
  • notice the changes in contents
  • update their knowledge about something (like prices etc.)
  • download something that satisfies particular criteria
  • … and many other actions!

The largest use of web bots is in web spidering, in which an automated script fetches, analyzes and files information from web servers at many times the speed of a human. These programs are also called as web crawlers.

Where can I use them?

The web has become a way of life and is carrying virtually all important data of this world now-a-day. Utilities that bring specific target data to your computer are a vital tool for business of these days. Web bots can address this very need. The web scraping, web aggregation, web crawling and web automation activities are important and useful in many cases – and all of these can be carried out with the help of bots (with minimal human supervision).

Web Scraping
Sometimes we need to use some data available publicly on some web site. For example, prices of some products. But copy-pasting it each time to update our copy is not at all a good idea. We can make use of web scraping bots to do this task for us. Just seat back relaxed, and this scrapper will do everything on behalf of you.

Web Aggregation
Whenever it comes to decision making, we must compare something with the other. So, how do we do this – by visiting all the web pages which are at different locations and noting down the values? Why not to aggregate these?! Web aggregation bots can fetch the contents you bother about, from different web pages and give you a consolidated view of them. The contents from different sources will look exactly same as seen on the original site! Moreover, fetching the contents will be a live, runtime activity. So whenever source gets updated, you are also updated!

Web crawlers
Want a repository of particular type of resources over the web? Use a web crawler! Web crawlers will go and collect whatever you want from the web. The criteria can be anything: get me all mp3 files, or make a list of all pages which have phrase “social media marketing” in the contents, … literally anything! These bots are generally aided by regular expressions, and hence work very effectively. They can give you the list of locations where resources you want are, and they can even download them for you!

Web Watcher
One cannot avoid competition when it comes to business. To stay a step ahead of your competitors, one must keep a watch on them. Every business proudly publishes its Press Releases, Media Coverage, Product Introductions, White Papers, Customer Wins, Upcoming Events, Promotions, Price Changes, Job Openings etc. This information can tell you a lot about what going on at your competitor’s end. But do we need to invest any time to watch our competitors? No! Web bots are also applicable for this. These bots will eventually go and check whether anything has changed on your competitors’ web page. If yes, it will immediately let you know by an email alert!

Automation based on Emails

This is really interesting wing of web bots. You can set email-automation bots to read the emails on your behalf. They can forward the emails to others based on the subject line or sender account or email contents. Moreover, these bots can even reply back to the sender with some predefined message, if they find particular word/phrase/information in mail body or subject line!

Link Verifiers

These bots are basically meant for verifying whether all the links on the web page point to valid, existing and reachable web resource. These bots generally start with the home page, and go on following each link present on each web page. (avoiding the duplications, of course!) Whenever they find the resource being pointed is missing or out of reach (for example, web page cannot be seen, or video file not readable) they notify the user in some or the other way.

These are some of the major uses of web bots, but the list is not limited to! You name the activity over the Internet and most of the times it is doable with web bots. Yes, most of the times… There are still some cases where web bots fail to perform some action, especially when working with dynamically generated links – the one’s which have pseudo session id in the URL. The session ID never remains the same and bots fail in this case.

There are many ethics and other precautions needs to be followed by each web bot (at least expected to do so). Good web bots developers always follow these ethics – most important being to respect robots.txt file on each web server. If this file denies you from crawling/visiting any directory, then you should not! (I would write on this topic in a separate post.)

But overall, web bots are becoming really important factors in the world of the web!


pubsubhubbub

Posted: February 27th, 2010 | Author: Prabhas Gupte | Filed under: Knowledge | Tags: , , , | No Comments »

The pubsubhubbub is a simple, open, server-to-server web-hook-based pubsub (publish/subscribe) protocol as an extension to Atom and RSS. The servers speaking pubsubhubbub (PSHB) protocol can get near-instant notifications (via webhook callbacks) when a topic (feed URL) they’re interested in is updated.

In a nutshell, the protocol is something like this:
1. Feed URL declares its hub server(s) in its Atom/RSS xml file. It mentions <link rel=”hub” …>. The hub(s) can be run by the publisher of the feed itself, or it can also be a community hub that anybody can use.
2. A subscriber (i.e. the server who is interested in the feed) initially fetches the Atom URL as usual. If the Atom file declare its hub(s), the subscriber then can avoid the polling repeatitions. Instead, it can resiter with the feed’s hub(s) to get notified about the updates.
3. The subscriber now subscribes for ther target feed URL from the hub.
4. Whenever publisher updates the feed, it pings the hub(s) telling them about the update.
5. The hub now fetches the published updates and multicasts the new/changed update to all the registered subscribers for this feed!

You can see following video and/or slide show to get fair idea about PSHB protocol.

This project is hosted on Google code. Source code, wiki and some other doucments are also available there.