E-commerce product migration with DOMDocument()

16 Jan 2018

When building a new e-commerce site to replace something pre-existing I usually try and get hold of a copy of the database in use such that I can save the client significant set-up time by simply mapping products, categories and so on across to the new database. If the agency in question is generally helpful that is rarely a problem and thus my preferred solution. Sometimes however the client has had a poor experience with their current agency, and/or the agency in question is simply unhelpful, and on occasion deliberately obstructive. Sadly it does happen, and I have even come across cases where the current agency have suspected that the client will be going elsewhere and simply taken the existing site offline with no warning.  I've been working around one such situation recently in which the client was potentially faced with an extended workload of having to recreate many thousands or products in the new site. Not ideal. With no access to the database a different approach was required.... which is where PHP's Document Object Model (DOM) comes in.

The PHP DOM provides a very handy API for operating on structured XML/HTML documents, and given that the product pages on the existing site all used a common template with identifiable nodes for the various key product parameters theirein lay the solution to saving the client hundreds of hours of tedious effort.  This case is one that will have commonality with a number of development situations so I figured I would share my solution here such that you can take from it what you will.

I'd already culled the product category structure from the existing site so what follows deals with the products themselves together with images, and any assignments to those categories.

My e-commerce platform is built upon the Codeigniter 3 framework so the scripts are presented in the context of a CI controller that is part of the build in question but of course it is easily adapted to any other context and really is just a bit of precedural code. it was just easier when it came to doing all the necessary stuff to map the harvested data into the local site under development.

The solution assumes that the site being examined has a proper XML sitemap. If it doesn't then some sort of recursive function starting from the category menu would be a good place to start in terms of harvesting all the site URLs.

I haven't really included anything way of error handling since this controller only gets called manually by me, I'm interested only in its utility but it could easily be turned into a tool with a nice user-interface and so on.

It all worked well and in the matter of a few minutes successfully recreated thousands of products in the local site. Huge timesaver.

While you're at it you can use the same approach to write all all the 301 redirects you'll need to map all the old product URLs to the new ones tready for when the new site goes live.

 

1. The basic controller + index function.


class Get_products extends Site_Controller 
{

   private $baseurl = 'http://www.somesite.com' //the baseurl of the site being examined


    function index()
    {
        ini_set('memory_limit', '-1');
        set_time_limit(0);

        $sitemap = $this->baseurl.'/sitemap.xml';

        //get all the urls
        $urls = $this->parseSitemap($sitemap);

        if(!empty($urls)) {

            echo 'Processing '.count($urls).' urls...';

            $n = 0;

            foreach ($urls as $url) {
                if($this->parseProduct($url))
                {
                    $n++;
                } 
            }
            echo $n.' products were successfully processed.';
        }
        else {
            echo 'No urls were found';
        }
       return;
    }



}

 

2. Parse the Sitemap

    function parseSitemap($sitemap)
    {
        /** This function simply gets all the URLs in the sitemap. Assuming the sitemap is structured correctly URLs are wrapped by the loc tag.
        *    In this case all product urls contain the string 'shop' in the URL so am ignoring any that doen't.
        *  It's not critical since ultimately the product page structure is used to determine if the URL is a product or not, but this just saves a bit of overhead.
        */
        
        $urls = array(); 
        $DomDocument = new DOMDocument(); 
        $DomDocument->preserveWhiteSpace = false;
        $DomDocument->load($sitemap); 
        $DomNodeList = $DomDocument->getElementsByTagName('loc'); 

        foreach($DomNodeList as $url) { 
            if(stripos($url->nodeValue, '_shop') !== false) {
                $urls[] = $url->nodeValue; 
            }
        }
        return $urls;
    }

 

3. Parse the product

This function does the work of picking through a retrieved product page. It includes calls to a number of helper functions which are reproduced with explanations below this one.

function parseProduct($url)
    {

        $html = $this->fetch_html($url);

        $dom = new DOMDocument();

        libxml_use_internal_errors(true); //if HTML 5 then lack of a DTD will cause errors on load, this will supress those.

        @$dom->loadHTML($html);
        
        libxml_clear_errors();
        
        $dom->preserveWhiteSpace = false;

        /**
         * In this case the product page structure uses an h1 tag for the product title.
         *     If no title is found then ignore the URL as it's not a product.
         *  Call to helper function elementByClass() to search the DOM for the appropriate element 
         */

        $className = 'product-title';
        $tagName = 'h1';
        $element = $this->elementByClass($dom, $tagName, $className);

        if($element !== false) {
            
            $productTitle = $element->nodeValue;

            /**
             * Subsequent product parameters can be discovered using the same method based on tag and class.
             * I'd already retrieved all the category names in use by the site so grabbing the product category assignment also so I can set up categories.
             * In this case the existing site had a 1 to 1 relationship between products and categories. If dealing with a one to many then if the sitemap has unique URLs
             * then simply look for duplicate products in the function saveProduct() and do category assignments as appropriate (assuming your new site can handle a one to many relationship).
             */

            // Look for a category name

            $className = 'detailProductCat';
            $tagName = 'div';
            $element = $this->elementByClass($dom, $tagName, $className);

            if($element !== false) {
                $productCategory = $element->nodeValue;
            }
            else {
                $productCategory = null;
            }

            // And for a product description

            $className = 'detailProductDesc';
            $tagName = 'div';
            $element = $this->elementByClass($dom, $tagName, $className);

            if($element !== false) {
                $productDescription = strip_empty_paras($this->innerHTML($element));
            }
            else {
                $productDescription = null;
            }

            // Now find a price.. in this case the site being analyzed didn't permit different prices for various options on a given product.

            $element = $dom->getElementsByTagName('h2')[0];
            if($element !== false) {
                $productPrice = preg_replace('/[^0-9.]/','',$element->nodeValue);
            }
            else {
                $productPrice = null;
            }

            /** Product Options
            * The site being analyzed used a  to present different variations of a given product.
            * So if the product has options find those by finding the select and iterating over the select options.
            * If no  found then it must be a single product with no choices.
            */

            $className = 'cartDdlOptions';
            $tagName = 'select';
            $element = $this->elementByClass($dom, $tagName, $className);

            
            $productOptions = array();
            if($element !== false) {
                $options = $element->getElementsByTagName('option');
                foreach ($options as $option) {
                    $productOptions[] = $option->nodeValue;
                }
            }
            
            /** PRODUCT IMAGES use the same philosophy. In this case the site used a carousel plugin so it was easy to identify the appropriate classname.
            * Images are copied to a local directory for later use.
            * In this case the source site generated image srcs dynamically so typically an image source could look like "/_loadimage.aspx?ID=172236"
            * so the following includes a call to a function that looks in the headers sent to determine the image type to save as.
            */
    

            $className = 'cycle-slide';
            $tagName = 'div';
            $element = $this->elementByClass($dom, $tagName, $className);
            $imagePaths = array();

            if($element !== false) {
                $images = $element->getElementsByTagName('img');
                $i = 0;
                $savePath = 'imagesTemp/';

                foreach ($images as $image) {
                    $src = $this->baseurl.$image->getAttribute('src');

                    //get the file contents

                    $imageString = file_get_contents($src);  

                    if($imageString !== false) {
                        //and work out the file type. Only interested in jpg, gif, or png in this case.

                        $type = $this->find_file_type($src);

                        if($type == 'gif' || $type == 'jpg' || $type == 'jpeg' || $type == 'png') {
                            $ext = str_replace('e', '', $type); //I know jpeg is a valid extension but I don't like it...

                            //save the file with a nice, SEO friendly filename. Codeigniter has a handy helper function, url_title(), that does a good job of cleaning up strings for URLs.

                            $save = file_put_contents($savePath.url_title($productTitle).'-'.$i.'.'.$ext,$imageString);
                            if($save !== false) {
                                $imagePaths[] = $savePath.url_title($productTitle).'-'.$i.'.jpg';
                                $i++;
                            }
                        }
                    }
                }
            }

            $product = array(
                'productTitle' => $productTitle,
                'productCategory' => $productCategory,
                'productDescription' => $productDescription,
                'productPrice' => $productPrice,
                'productOptions' => $productOptions,
                'imagePaths' => $imagePaths
            );

            // Pass the product data to the saveProduct function that does whatever your own e-commerce platform needs in terms of database and file structure.
            return $this->saveProduct($product);
            
        }
        
        return false;

    }

4. Get the HTML

Simple cURL request to fetch the HTML for a given URL

function fetch_html($url)
    {
        

        $ch = curl_init();
        $timeout = 5;
        curl_setopt($ch, CURLOPT_URL, $url);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
        curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
        $html = curl_exec($ch);
        curl_close($ch);

        return $html;
    }

 

5. Find elements by class

The site being examined used specific classes to identify key areas of markup in the product template. We need to grab those to get to the product parameters. PHP's DOMDocument doesn't include a direct means of accessing nodes by classname so this function takes care of that.

function elementByClass(&$domParent, $tagName, $className)
    {
        /** PHPs DOMDocument() class doesn't include a direct means of identifying nodes by classname.
        * But you can iterate over childnodes looking for the appropriate class attribute
        * I only want the first instance but have structured the function to provide an array of nodes should it be needed
        */ 

        $nodes = array();

        $childNodes = $domParent->getElementsByTagName($tagName);
        $tagCount = 0;

        foreach ($childNodes as $node) {
            if (stripos($node->getAttribute('class'), $className) !== FALSE) {
                $nodes[] = $node;

                //you could just do this is always wanting only the first node.
                //return $node  
            }
        }
           
           //in this case I just want the first

           if(!empty($nodes[0]))  {
               return $nodes[0];
           }
           else {
               return false;
           }
           
    }

 

6. Inner HTML

The product description exists over multiple paragraphs with tags I was keen to preserve so this helper function does just that.

function innerHTML( $parentNode )
    {
        /* Neat helper function extracts the inner HTML of a DOM node
        * credit to https://kuttler.eu/en/post/php-innerhtml/  for saving me time
         */

        $innerHTML = '';
        $elements = $parentNode->childNodes;

        foreach( $elements as $element ) { 
            if ( $element->nodeType == XML_TEXT_NODE ) {
                $text = $element->nodeValue;
                $innerHTML .= $text;
            }     
            elseif ( $element->nodeType == XML_COMMENT_NODE ) {
                $innerHTML .= '';
            }     
            else {
                $innerHTML .= '<';
                $innerHTML .= $element->nodeName;
                if ( $element->hasAttributes() ) { 
                    $attributes = $element->attributes;
                    foreach ( $attributes as $attribute )
                        $innerHTML .= " {$attribute->nodeName}='{$attribute->nodeValue}'" ;
                }     
                $innerHTML .= '>';
                $innerHTML .= $this->innerHTML( $element );
                $innerHTML .= "nodeName}>";
            }     
        }     
        return $innerHTML;
    }

 

7.  Image file types

Browsers use the content-type header rather than file extension to determine the type of image file being served. In this case because the site under examination served image data dynamically it's necessary to know what the image type is such that the image can be copied and saved correctly. This function does a simple examination of the headers served from the image src.

function find_file_type($image_src)
    {
        /* browsers use the content-type header to understand if something is an image and what kind it is.
        * this function simply uses PHPs built-in get_headers() to get the headers returned at the image src url and returns the type if it's an image.
        */

        $type = null;

        $headers = get_headers($image_src);
        
        if(!empty($headers)) {
            foreach ($headers as $h) {
                //just looking for an "image/*" string
                if(strpos($h, 'image/') !== FALSE)
                {
                    $dat = array();
                    //extract the type substring
                    preg_match("/image\/(.+?);/", $h, $dat);
                    if(!empty($dat[1])) {
                        return $dat[1];
                    }
                }
            }
        }

        return false;
    }

 

8. Save Product

Just whatever you need to do here...

function saveProduct($product)
    {

        /* function contains whatever you need to do to create the product  in the context of your own site
        * In my case various database operations around products, product options, category assignments, and setting up the file structure for the product images.
        * for the record my e-commerce platform maintains product images in separate folders for each product, it makes user management of them much simpler than having a single repository with thousands of pictures.
        .
        .
        .
        .
        */

    }