PHP DOM (Document Object Model) Parser is very good at dealing with XML and HTML. It travels based on tree-structure and loads the data into DOM object. The first thing you need to do is to construct a dom document object and then load the html content in it.

// a new dom object
$dom = new domDocument;

// load the html into the object
$dom->loadHTML($html);

// discard white space
$dom->preserveWhiteSpace = false;

Concept of DOM

Everything in a DOM Document is a node. The DOMDocument is a hierarchical tree structure of nodes. It starts with a root node. The root node can have child nodes and child nodes can have child nodes on their own. For example, there is a root element (HTML) with two children (HEAD and BODY).

<title>The Title</title>

It has two nodes - A DOMElement with a DOMText child.

<div class="header">

It has three nodes - the DOMElement with a DOMAttr holding a DOMText.

There are two important functions that can be used to extract contents from the html structure:

  1. getElementsByTagName
  2. getElementById

Get Elements by Tag Name

The function getElementsByTagName returns an array of objects that contains all the elements with a given tag name. This function is useful when you want to read the content, or attribute of multiple HTML elements that have the same tag.

Example: For Getting Tables

$tables = $dom->getElementsByTagName('table');
foreach($tables as $table)
{
echo $dom->saveHTML($table);
}

The saveHTML function gets the exact html inside that particular node. To get the total number of elements, you the length attribute.

echo 'Found: ' . $tables->length . ' items';

Example: For Getting Links

$dom = new domDocument;
$dom->loadHTML($html);
$links = $dom->getElementsByTagName('a')
foreach ($links as $node)
{
echo $dom->saveHtml($node);
}

To get the text values of the node:

echo $node->nodeValue; 

To check if the href attribute exists:

echo $node->hasAttribute('href');

To get the href attribute:

echo $node->getAttribute('href');

To change the href attribute:

$node->setAttribute('href', 'something else');

To remove the href attribute:

$node->removeAttribute('href');

Get Element by Id

It returns an object that contains the element with a given id, or NULL if the element is not found. This function is useful when you want to read the content, or attribute value of a HTML element with a specified id.

$element = $dom->getElementById('myid');
echo $element->nodeValue;

DOMXPath in PHP

The DOMXpath class is part of PHP DOM extension. The XPath uses path expressions to select nodes. 

$doc = new domDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);

Syntax for XPath Query

  • / Selects from the root node
  • // Selects nodes in the document from the current node that match the selection no matter where they are
  • . Selects the current node
  • .. Selects the parent of the current node
  • @ Selects attributes

Parse h1 tag text

$contents = $xpath->query('//h1');
if (!is_null($contents))
{
foreach ($contents as $i => $node) {
$heading1 .= ' ' . $node->nodeValue;
}
}
echo("h1: $heading1\n\n");

Parse h3 and h4 tag text

$contents = $xpath->query('//h3 | //h4');
if (!is_null($contents))
{
foreach ($contents as $i => $node) {
$heading3and4 .= ' ' . $node->nodeValue;
}
}
echo("h3 and h4s: $heading3and4\n\n");

Parse meta description

$metaDescription = '';
$contents = $xpath->query('/html/head/meta[@name="description"]/@content');
if ($contents->length != 0)
{
foreach ($contents as $content) {
$metaDescription .= $content->value;
}
}
echo("Meta Description: $metaDescription\n\n");

Parse meta keywords

$contents = $xpath->query('/html/head/meta[@name="keywords"]/@content');
if ($contents->length != 0)
{
foreach ($contents as $content) {
$metaKeywords .= ' ' . $content->value;
}
}
echo("Meta Keywords: $metaKeywords\n\n");

Parse Elements with class Name

$nodeList = $xpath->query("//div[@class='class_name']");
$node = $nodeList->item(0);

// To check the result:
echo "<p>" . $node->nodeValue . "</p>";