Save time by automating CMS Population with Selenium and Python

alt_text

If you’ve got a Content Management System and a lot of content to populate it with, as some large companies might, it could take weeks just to do one site. Even generating just the page structure might take a day. Automatically creating the structure and filling each page with the appropriate text content, while not perfect, might save you a lot of time!

In order to automate this, normally you\’ll have an excel sheet for each website you need to populate. When I\’ve done this the excel sheet has a special style to represent the page structure, where red is a root page, light red is a sub page, and so on.
The script we produce to automate this data entry obviously will be different for every CMS frontend, but the process will probably be similar.

This is assuming you don’t have access to an API, in which case you’d just send appropriate requests using the Python requests package.

Parsing the Excel Sheet

I’m going to use the Openpyxl library to read the excel sheet:

from openpyxl import load_workbook
file = 'Spanish Website Translation.xlsx'
wb = load_workbook(filename = file)
ws = wb.active

We load the workbook, and assuming there\’s only one worksheet, set the current worksheet. Looking at the sheet, each page level in the structure has its own style. Check individual cell styles with

ws['A1'].fill.fill_type # Returns 'solid', 'pattern', None etc
ws['A1'].fill.start_color.index # Returns hex val of cell colour
ws['A1'].fill.start_color.tint # Returns tint value as float

Once we can map each style to a page level, just write a function to take a cell, parse its style and return a page level. In this case I\’ll return a string

def find_style(cell):
    if cell.fill.fill_type == 'solid':
        if cell.fill.start_color.index == 'FFFF554D':
            return {'err':0, 'text' : RootPage}
        elif cell.fill.start_color.tint == -0.3499862666707358:
            return {'err':0, 'text' : Page}
        elif cell.fill.start_color.tint == -0.1499984740745262            
            return {'err':0, 'text' : SubPage}
        elif cell.fill.start_color.index == 'FFFFB9AC':            
            return {'err':0, 'text' : SubSubPage}
        else:
            return {'err':1,
                    'text' : Style not recognised row %i %s\n % (cell.row, cell.fill)
                   }
    elif cell.fill.fill_type == None:
        return {'err':0, 'text' : Text}
    else:
        return {'err':1,
                'text' : Style not recognised row %i %s\n % (cell.row, cell.fill.fill_type)
               }

This is an example of a function returning different page stypes given a cell. Note an error is returned too, we need to be able to parse every style in the excel sheet, so we can use this to find any styles we may have missed. One way is to go through every row and print the message if there\’s an error.

CMS Frontent Manipulation

So now we\’ve got a pagestyle parser, let\’s do some CMS frontend manipulation!

def log_in(driver):
driver.find_element_by_id('user_session_email').send_keys('youremail')    driver.find_element_by_id('user_session_password').send_keys('password')
    btn = driver.find_element_by_class_name('btn')
    btn.click()

This is a pretty simple log in function in Selenium, I\’ve discussed this already in my Automating Web Download Tasks series.

def create_page(driver, page_number, title):
    url = https://yourCMDFrontent.com/admin/pages/%i/new % page_number
    driver.get(url)
    new_page = fill_new_page_form(driver, title)
    return new_page

Now the CMS frontend I was working with had a URL to create a new page under a given page number. Each page had a unique page number, which incremented each time a new one was made. This function creates a new page under the page we pass it. When we go to this URL we\’re presented with a form to fill in the page title and other info, so let\’s just refer to a function we\’ve not created yet to fill that page and click the update button. This function also wants to return your newly created page number so you can create another page under that if needed.

That was a bit of a mouthful! But let\’s just imagine this whole structure as a tree, starting at the root page (your landing page) where each branch (page) can have many sub-branches (sub-pages). The fill_new_page_form will just take our translated title, fill in the required form inputs, and hit submit. Now that submit will redirect us to the page’s admin page, which contains the new page number, which we need to keep.

def fill_new_page_form(driver, title):
    driver.find_element_by_id('page_title').send_keys(title)
    driver.find_element_by_id('page_menu_title').send_keys(title)
    driver.find_element_by_name('button').click()
    page_no = int(str.split(driver.current_url, '/')[5])
    return page_no

This is again, fairly straightforward, we fill in a few form inputs and click submit. Then get the redirected URL containing the new page number. If you’re still confused with how you use Selenium with Python I suggest you read my short series and the docs.

In my excel sheet if the cell has no style, it’s just content for the preceeding page so I need to stuff it in a text box. For the time being let’s just stub this as add_text_to_page that doesn’t do anything.

Page structure logic

As always, let’s divide and conquer this problem, how do we loop through the excel sheet and create the right pages and sub pages? We’ve got to keep track of which page number we’re at on each struture level. Now because I know I’ve got 4 layers maximum in this website I’m hard-coding with these. A more general solution would be to use an array, and push and pop values where necessary.

Let’s just try to create the root pages first. I.e the ones at the top level. These are pages like:

  • Home
  • About
  • Contact us

Given the site "page number" is 92. This is what all the root pages must be appended to, we’ll hard code it. The sp stand for the spanish translation website.

sp_root = 92
root  = 0
current_page = 0
text = []

for cell in ws['A1:A100']: 
    ret = find_style(cell[0]) # Get cell style and map to page type
    if ret['err']: # If this 'err' flag is set, then the style wasn't recognised and we can't continue
        raise Exception(ret['text'])
    else:
        #Found Root, add page to the website root page number, in this case hard coded
         if ret['text'] == 'Root':

                #Update our root page number
                root = create_page(driver, sp_root, ws['B%i'%cell[0].row].value)

                #Set current page number, this is so we can add text to the current page when we get it
                current_page = root
        elif ret['text'] == 'Text':
            #Add to text array, adds text to page when new section is found
            text.append(ws['B%i'%cell[0].row].value)

This is a bit to take in, first we set the parameters root and current_page to 0. Then we loop through the first 100 rows in the sheet. If we find a "root" style then we create a page with its name under the hardcoded root.

If the style is text, bearing in mind my excel sheet had mutiple text rows each containing a paragraph, then append it to an array for use later.

Now this should create our root pages, but it doesn’t do anything with our text yet. How do we handle this? Well I think the logic should go like this:

  1. Create page, set as current_page
  2. Append next text rows to text array
  3. Once we find a new page style, add all the previous text to the page in 1 and empty the text array

So how does this look?

sp_root = 92
root  = 0
current_page = 0
text = []

for cell in ws['A1:A100']: 
    ret = find_style(cell[0]) # Get cell style and map to page type
    if ret['err']: # If this 'err' flag is set, then the style wasn't recognised and we can't continue
        raise Exception(ret['text'])
    else:
    # If the next thing isn't another text row
        if ret['tet'] != 'Text':
            # We need to first add all the text we've collected to the page we just created
            if len(text) > 0:
                add_text_to_page(driver,  current_page,  text)
                # And clear the text array
                text = []

            #Found Root, add page to the website root page number, in this case hard coded
             if ret['text'] == 'Root':

                #Update our root page number
                root = create_page(driver, sp_root, ws['B%i'%cell[0].row].value)

                #Set current page number, this is so we can add text to the current page when we get it
                current_page = root
            else:
                # TODO: Implement pages on lower level than the root

        elif ret['text'] == 'Text':
            #Add to text array, adds text to page when new section is found
            text.append(cell[0].value)

# If there's any text left, add it to the last page created
if len(text) > 0:
    add_text_to_page(driver,  current_page,  text)
    # And clear the text array
text = []

So this creates Root pages, and adds the required text to them! There’s a few issues where text might be added to a root page that should be in a lower level page, but this can be fixed when lower level pages are implemented.
How do we do this? It’s simple now we’ve got the general logic!

  1. Add a new variable, for example called sub_page
  2. Add another elif after the Root checker for another string, say ‘SubPage’ from the find_style function, which creates a new page given the root as the base page, sets current_page with it and the new sub_page variable.
  3. Text handling is automatically handled by our logic

Easy! Ask your questions below, or send a direct email to rowan@workyourtech.com

Leave a Reply

Your email address will not be published. Required fields are marked *