Wednesday, Sep 30. 2009

Parsing HTML documents with Simple Html DOM

I had to get some information from an XHTML document, the information required was on nested tables, I’ve started working on that and my first approach was to get the needed info using regular expressions.

After 2 hours I’ve got a set of regex that worked pretty well, but when I received a new document, the layout was changed a little bit, and my didn’t worked as expected, the content was on the same table but this time it had nested tables, which were a really big problem.

I already have used the PHP DOM, but only on xml files, I didn’t was aware that it had the ability to parse HTML, so I’ve started working with this and within an hour I had it working, and this time the changes in the document didn’t affect the scrapping.

In about an hour I had a larger class with several methods to get all the elements I needed, but suddenly I was presented with another challenge, again with nested tables, sometimes the number of childs were shorter than expected, I’ve experimented several things until I found this Simple Html Dom it is pretty straight forward, and it does an excellent job scrapping html documents, all the methods I did were replaced by this:

            $html   = curl_exec($ch);
            $dom = new simple_html_dom();
            $dom->load($html);
            $items = array();
            $tabla = $dom->find('table[cellpadding^=2]', 0);
 
            foreach ($dom->find('table[cellpadding^=2]') as $table) {
                foreach ($table->find('tr') as $tr) {
                    $link = trim($tr->find('a', 0)->title);
                    if ($link) {
                        $item['item']  = $link;
                        $item['price'] = trim($tr->children(2)->plaintext);
                        $item['bids'] = trim($tr->children(3)->plaintext);
                        $items[] = $item;
                    }
                }
            }



Post comment