Posts tagged: PHP

Screen scraping with PHP 5

When it comes to web scraping, PHP5 is the best choice. PHP5 have real and standardized XML capabilities, something native PHP4 lacked. You can load an XML DOM object and select elements using XPath.

Loading any HTML file

Not only can you make use of well-formed XML files, you can also grab any HTML tag-soup. As you may know most files on the Web are not valid HTML, let alone XML/XHTML. Parsing them with a strict XML parser would yield nothing but error messages.

The simple but powerful function we need is called “loadHTMLFile” and accepts any URL as parameter.
Here’s a sample of its usage, which will grab all the hyperlinks on this site and format them as simple linked HTML list:

<?php
$dom = new domdocument;
$url = 'http://prabhasgupte.com';
@$dom->loadHTMLFile($url);
$xpath = new domxpath($dom);
$xNodes = $xpath->query('//a');
echo '<h1>Links on PrabhasGupte.com</h1>';
echo '<ul>';
foreach ($xNodes as $xNode) {
    $sLinktext = @$xNode->firstChild->data;
    $sLinkurl = $xNode->getAttribute('href');
    if ($sLinktext != '' && $sLinkurl != '') {
        echo '<li><a href="' . $sLinkurl . '">' . $sLinktext . <a></li>';
    }
}
echo '</ul>';
?>

I highlighted the two relevant lines (which are preceded by an “@” symbol to supress error messages, something which is necessary at least in this version of PHP5 when you use loadHTMLFile on non-XML files). So basically the script above says “Get the PrabhasGupte.com homepage, and grab every link, then go through all links and output them again”.

Instead of PrabhasGupte.com homepage, you can extract data from any page which is accessible online.

Visitors’ Tracker

This is a web-based service, developed to track daily visits to your website. You just need to include a small javascript inside your website’s home page, and the script takes care of everything else.

It collects many statistics regarding site visits, including IP addresses, page hits, site hits per IP address, hour-wise hits etc. At the end of the day, this service sends a consolidated email report to your email address. You can then make use of this report to decide how appealing your site is to users.