Screen scraping with PHP 5
When it comes to web scraping, PHP5 is the best choice. PHP5 have real and standardized XML capabilities, something native PHP4 lacked. You can load an XML DOM object and select elements using XPath.
Loading any HTML file
Not only can you make use of well-formed XML files, you can also grab any HTML tag-soup. As you may know most files on the Web are not valid HTML, let alone XML/XHTML. Parsing them with a strict XML parser would yield nothing but error messages.
The simple but powerful function we need is called “loadHTMLFile” and accepts any URL as parameter.
Here’s a sample of its usage, which will grab all the hyperlinks on this site and format them as simple linked HTML list:
<?php
$dom = new domdocument;
$url = 'http://prabhasgupte.com';
@$dom->loadHTMLFile($url);
$xpath = new domxpath($dom);
$xNodes = $xpath->query('//a');
echo '<h1>Links on PrabhasGupte.com</h1>';
echo '<ul>';
foreach ($xNodes as $xNode) {
$sLinktext = @$xNode->firstChild->data;
$sLinkurl = $xNode->getAttribute('href');
if ($sLinktext != '' && $sLinkurl != '') {
echo '<li><a href="' . $sLinkurl . '">' . $sLinktext . <a></li>';
}
}
echo '</ul>';
?>
I highlighted the two relevant lines (which are preceded by an “@” symbol to supress error messages, something which is necessary at least in this version of PHP5 when you use loadHTMLFile on non-XML files). So basically the script above says “Get the PrabhasGupte.com homepage, and grab every link, then go through all links and output them again”.
Instead of PrabhasGupte.com homepage, you can extract data from any page which is accessible online.