PHP Document Object Model (DOM) Parser

Document Object Model

PHP DOM (Document Object Model) Parser is very good at dealing with XML and HTML. It travels based on tree-structure and loads the data into DOM object. The first thing you need to do is to construct a dom document object and then load the html content in it.

// a new dom object
$dom = new domDocument; 
 
// load the html into the object
$dom->loadHTML($html); 
 
// discard white space
$dom->preserveWhiteSpace = false;

Concept of DOM

Everything in a DOM Document is a node. The DOMDocument is a hierarchical tree structure of nodes. It starts with a root node. The root node can have child nodes and child nodes can have child nodes on their own. For example, there is a root element (HTML) with two children (HEAD and BODY).

<title>The Title</title>

It has two nodes - A DOMElement with a DOMText child.

It has three nodes - the DOMElement with a DOMAttr holding a DOMText.

There are two important functions that can be used to extract contents from the html structure:

getElementsByTagName
getElementById

1. Get Elements by Tag Name

The function getElementsByTagName returns an array of objects that contains all the elements with a given tag name. This function is useful when you want to read the content, or attribute of multiple HTML elements that have the same tag.

Example: For Getting Tables

$tables = $dom->getElementsByTagName('table');
foreach($tables as $table)
{
 echo $dom->saveHTML($table);
}

The saveHTML function gets the exact html inside that particular node. To get the total number of elements, you can use the length attribute.

echo 'Found: ' . $tables->length . ' items';

Example: For Getting Links

$dom = new domDocument;
$dom->loadHTML($html);
$links = $dom->getElementsByTagName('a')
foreach ($links as $node) 
{
 echo $dom->saveHtml($node);
}

There are four things - tag name, attribute name, attribute value, and enclosed tag content.

1. To get the text values of the node (enclosed tag content):

echo $node->nodeValue;

2. To check if the href attribute exists:

echo $node->hasAttribute('href');

3. To get the href attribute value:

echo $node->getAttribute('href');

4. To change the href attribute value:

$node->setAttribute('href', 'something else');

5. To remove the href attribute and its value:

$node->removeAttribute('href');

2. Get Element by Id

It returns an object that contains the element with a given id, or NULL if the element is not found. This function is useful when you want to read the content, or attribute value of a HTML element with a specified id.

$element = $dom->getElementById('myid');
echo $element->nodeValue;

3. DOMXPath in PHP

The DOMXpath class is part of PHP DOM extension. The XPath uses path expressions to select nodes.

$doc = new domDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);

Syntax for XPath Query

/ Selects from the root node
// Selects nodes in the document from the current node that match the selection no matter where they are
. Selects the current node
.. Selects the parent of the current node
@ Selects attributes

Parse h1 tag text

$contents = $xpath->query('//h1');
if (!is_null($contents)) 
{
 foreach ($contents as $i => $node) {
 $heading1 .= ' ' . $node->nodeValue;
 }
}
echo("h1: $heading1\n\n");

Parse h3 and h4 tag text

$contents = $xpath->query('//h3 | //h4');
if (!is_null($contents)) 
{
 foreach ($contents as $i => $node) {
 $heading3and4 .= ' ' . $node->nodeValue;
 }
}
echo("h3 and h4s: $heading3and4\n\n");

Parse meta description

$metaDescription = '';
$contents = $xpath->query('/html/head/meta[@name="description"]/@content');
if ($contents->length != 0) 
{
 foreach ($contents as $content) {
 $metaDescription .= $content->value;
 }
}
echo("Meta Description: $metaDescription\n\n");

Parse meta keywords

$contents = $xpath->query('/html/head/meta[@name="keywords"]/@content');
if ($contents->length != 0) 
{
 foreach ($contents as $content) {
 $metaKeywords .= ' ' . $content->value;
 }
}
echo("Meta Keywords: $metaKeywords\n\n");

Parse Elements with class Name

$nodeList = $xpath->query("//div[@class='class_name']");
$node = $nodeList->item(0);

// To check the result:
echo "<p>" . $node->nodeValue . "</p>";