screen scraping: simple html dom library

how to use open source php simple html dom parser library to provide jquery-grade awesomeness for easy screen scraping without messy regular expressions

client-side developers always had it easy: libraries such as jquery and prototype make finding elements on the page reliable and efficient. in php, regular expressions tend to get rather messy, dom calls can be confusing and verbose and often the string functions just aren’t enough.

the simple html dom parser is implemented as a simple php class and a few helper functions. it supports css selector style screen scraping (such as in jquery), can handle invalid html, and even provides a familiar interface to manipulate a dom.

here’s a sample of simplehtmldom in action:

$html = file_get_dom('http://www.google.com/'); foreach($html->find('a') as $element)    echo $element->href;

this snippet is fairly self explanatory – file_get_dom() is a simple helper function in the library that fetches the page and constructs a new simplehtmldom object around it. once the object is available, we can easily use simple css selectors to find our elements – in this case, anchors – and iterate over them just as we would with php 5’s standard dom classes. (the equivalent code with the standard dom classes is twice as long.)

but the library doesn’t stop there – as well as traversing the dom and extracting information, you can also alter it. consider this snippet:

$html = str_get_html('hello world  '); $html->find('div', 1)->class = 'bar'; $html->find('div[id=hello]', 0)->innertext = 'foo';

the library supports many dom-style approaches for manipulation, from exposing real attributes as shown here, to a few helper methods. it also includes other methods to traverse the current node – children(), parent(), first_child() and so on.

real scraping? easy. here’s their slashdot sample:

$html = file_get_html('http://slashdot.org/'); foreach($html->find('div.article') as $article) {    $item['title']     = $article->find('div.title', 0)->plaintext;    $item['intro']    = $article->find('div.intro', 0)->plaintext;    $item['details'] = $article->find('div.details', 0)->plaintext;    $articles[] = $item;} print_r($articles);

and finally, there’s always a simple save mechanism:

$html->save('altered-dom.html');

ready to get started? head over to the project websiteonline documentation or the project page on sourceforge.