<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Ivan Villareal &#187; Parsing HTML documents with Simple Html DOM &#8211; Ivan Villareal</title>
	<atom:link href="http://ivanvillareal.com/tag/scrapping/feed/" rel="self" type="application/rss+xml" />
	<link>http://ivanvillareal.com</link>
	<description>IT stuff and more...</description>
	<lastBuildDate>Tue, 01 Nov 2011 23:00:10 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Parsing HTML documents with Simple Html DOM</title>
		<link>http://ivanvillareal.com/development/parsing-html-documents-whit-simple-html-dom/</link>
		<comments>http://ivanvillareal.com/development/parsing-html-documents-whit-simple-html-dom/#comments</comments>
		<pubDate>Wed, 30 Sep 2009 23:02:34 +0000</pubDate>
		<dc:creator>Ivan Villareal</dc:creator>
				<category><![CDATA[Development]]></category>
		<category><![CDATA[DOM]]></category>
		<category><![CDATA[php]]></category>
		<category><![CDATA[Scrapping]]></category>

		<guid isPermaLink="false">http://ivanvillareal.com/?p=45</guid>
		<description><![CDATA[I had to get some information from an XHTML document, the information required was on nested tables, I&#8217;ve started working on that and my first approach was to get the [...]]]></description>
			<content:encoded><![CDATA[<p>I had to get some information from an XHTML document, the information required was on nested tables, I&#8217;ve started working on that and my first approach was to get the needed info using regular expressions.</p>
<p>After 2 hours I&#8217;ve got a set of regex that worked pretty well, but when I received a new document, the layout was changed a little bit, and my didn&#8217;t worked as expected, the content was on the same table but this time it had nested tables, which were a really big problem.</p>
<p>I already have used the PHP DOM, but only on xml files, I didn&#8217;t was aware that it had the ability to parse HTML, so I&#8217;ve started working with this and within an hour I had it working, and this time the changes in the document didn&#8217;t affect the scrapping.</p>
<p>In about an hour I had a larger class with several methods to get all the elements I needed, but suddenly I was presented with another challenge, again with nested tables, sometimes the number of childs were shorter than expected, I&#8217;ve experimented several things until I found this <a href="http://sourceforge.net/projects/simplehtmldom/">Simple Html Dom</a>&nbsp;it is pretty straight forward, and it does an excellent job scrapping html documents, all the methods I did were replaced by this:</p>

<div class="wp_syntax"><div class="code"><pre class="php" style="font-family:monospace;">            <span style="color: #000088;">$html</span>   <span style="color: #339933;">=</span> <span style="color: #990000;">curl_exec</span><span style="color: #009900;">&#40;</span><span style="color: #000088;">$ch</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
            <span style="color: #000088;">$dom</span> <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> simple_html_dom<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
            <span style="color: #000088;">$dom</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">load</span><span style="color: #009900;">&#40;</span><span style="color: #000088;">$html</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
            <span style="color: #000088;">$items</span> <span style="color: #339933;">=</span> <span style="color: #990000;">array</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
            <span style="color: #000088;">$tabla</span> <span style="color: #339933;">=</span> <span style="color: #000088;">$dom</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">find</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">'table[cellpadding^=2]'</span><span style="color: #339933;">,</span> <span style="color: #cc66cc;">0</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
            <span style="color: #b1b100;">foreach</span> <span style="color: #009900;">&#40;</span><span style="color: #000088;">$dom</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">find</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">'table[cellpadding^=2]'</span><span style="color: #009900;">&#41;</span> <span style="color: #b1b100;">as</span> <span style="color: #000088;">$table</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
                <span style="color: #b1b100;">foreach</span> <span style="color: #009900;">&#40;</span><span style="color: #000088;">$table</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">find</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">'tr'</span><span style="color: #009900;">&#41;</span> <span style="color: #b1b100;">as</span> <span style="color: #000088;">$tr</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
                    <span style="color: #000088;">$link</span> <span style="color: #339933;">=</span> <span style="color: #990000;">trim</span><span style="color: #009900;">&#40;</span><span style="color: #000088;">$tr</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">find</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">'a'</span><span style="color: #339933;">,</span> <span style="color: #cc66cc;">0</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">title</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
                    <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #000088;">$link</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
                        <span style="color: #000088;">$item</span><span style="color: #009900;">&#91;</span><span style="color: #0000ff;">'item'</span><span style="color: #009900;">&#93;</span>  <span style="color: #339933;">=</span> <span style="color: #000088;">$link</span><span style="color: #339933;">;</span>
                        <span style="color: #000088;">$item</span><span style="color: #009900;">&#91;</span><span style="color: #0000ff;">'price'</span><span style="color: #009900;">&#93;</span> <span style="color: #339933;">=</span> <span style="color: #990000;">trim</span><span style="color: #009900;">&#40;</span><span style="color: #000088;">$tr</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">children</span><span style="color: #009900;">&#40;</span><span style="color: #cc66cc;">2</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">plaintext</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
                        <span style="color: #000088;">$item</span><span style="color: #009900;">&#91;</span><span style="color: #0000ff;">'bids'</span><span style="color: #009900;">&#93;</span> <span style="color: #339933;">=</span> <span style="color: #990000;">trim</span><span style="color: #009900;">&#40;</span><span style="color: #000088;">$tr</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">children</span><span style="color: #009900;">&#40;</span><span style="color: #cc66cc;">3</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">plaintext</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
                        <span style="color: #000088;">$items</span><span style="color: #009900;">&#91;</span><span style="color: #009900;">&#93;</span> <span style="color: #339933;">=</span> <span style="color: #000088;">$item</span><span style="color: #339933;">;</span>
                    <span style="color: #009900;">&#125;</span>
                <span style="color: #009900;">&#125;</span>
            <span style="color: #009900;">&#125;</span></pre></div></div>

]]></content:encoded>
			<wfw:commentRss>http://ivanvillareal.com/development/parsing-html-documents-whit-simple-html-dom/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

