Python web scraping – Part 2

This is a two-part series on performing python web scraping. In our first part, Web Scraping with Python, we talk about the basics, ethics, and prerequisites of python web scraping. In this article, we will be talking about the programming side of scraping. So, let’s get straight into it.

NOTE: The entire code can be found on my GitHub page. The link is provided at the end of the article

Building the Web Scraper

The program has been written in Jupyter notebook.The following are steps to write the program:

  1. Import all the required library
  2. Create a new excel workbook
  3. Function to take in extracted data and enter it in the created excel sheet.
  4. Automate:  for each student, enter the ID, fetch the result, and parse the page to get the entire scorecard.
  5. Take arguments from the user and run the program

1.  Import all the required library:

import requests 
import bs4
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import openpyxl 
import argparse

2. Create a new excel workbook

Here we are creating a function that will create a new workbook, rename the active sheet, insert the required headers, and save it in a path specified by us.

The path is sent to this function through the ‘f_loc’ variable from the command line argument

# Function named crtnewwb- will be called once.
def crtnewwb(f_loc):
    # Calling a Workbook() function of openpyx to create a new blank Workbook object and assigning it to variable f_wb 
    f_wb = openpyxl.Workbook()

    # Getting workbook's active sheet using the active attribute and assign it to a variable f_sheet
    f_sheet = 

    # changing the name of the sheet 
    f_sheet.title = "2nd PUC"

    # Adding the Column names
    f_sheet.cell(1,1).value='Reg Num'

    #Saving the file in the location provided by you 

3. Storing the extracted data to files

This function takes in the “row_num” (which makes sure each student entry is captured in a new row), all the extracted data (student’s registration number, name , marks of 6 subject ,total) ,file location and assigns them into the excel workbook. It data is saved each time this function is called

# Function to call the workbook and pass in extracted data
def inpt(f_loc,row_num,name,sub1,sub2,sub3,sub4,sub5,sub6,total,reg):
    #opening the workbook
    f_wb = openpyxl.load_workbook(f_loc)
    #selecting the sheet where we have to input data
    f_sheet = f_wb['2nd PUC']
    #inputting the data in to respective fields
    f_sheet.cell(row_num,1).value = reg
    f_sheet.cell(row_num,2).value = name
    f_sheet.cell(row_num,3).value = sub1
    f_sheet.cell(row_num,4).value = sub2
    f_sheet.cell(row_num,5).value = sub3
    f_sheet.cell(row_num,6).value = sub4
    f_sheet.cell(row_num,7).value = sub5
    f_sheet.cell(row_num,8).value = sub6
    f_sheet.cell(row_num,9).value = total
    #keeping a check as to how many rows are entered
    #saving the file

4. Python Web Scraping Automation

When the file path and the range of students registration number is passes to this function, it opens the web browser enter the Registration number, extracts the data and send it to the “inpt” function which saves the data. It is extraction of the data with the help of a beautiful soup package and the tags ‘reg’ and ‘td’ , that was identified earlier. There will be cases where students’s registration number does not exist due to which the progmean runs into error. To tackle such situations, the ‘try’ and ‘except’ error handling is used.

def automation(f_loc,inpt1,inpt2):
    #initializing chrome browser as selenium webdriver to variable browser 
    browser = webdriver.Chrome()
    #calling the function to create new workbook
    #initializing row =2 as the 1st roe in workbook are the heading
    rows = 2
    #fetching the HTML of the website
    #extracting data for the given range
    for i in range(int(inpt1),int(inpt2)):
            ele = browser.find_element_by_name('reg')
            soup = bs4.BeautifulSoup(browser.page_source,'lxml')
            req_data ='td')
            name = req_data[1].getText().strip()
            sub1 = req_data[7].getText().strip()
            sub2= req_data[11].getText().strip()
            sub3 =req_data[17].getText().strip()
            sub4 =req_data[21].getText().strip()
            sub5 =req_data[25].getText().strip()
            sub6 =req_data[29].getText().strip()
            gtm =req_data[33].getText().strip()

5. Running the script

This is the main program which allows the user to input the required field and then performs all the operations by calling the functions defined previously..

if __name__ == '__main__':
    #initializing variable parser to argument_parser 
    parser = argparse.ArgumentParser() 
    #adding 3 arguments: 2 for the range of Registration number and one for file path   
    parser.add_argument("-i1", "--input1", help = "file path of final workbook") 
    parser.add_argument("-i2", "--input2", help = "File to be automated ") 
    parser.add_argument("-fp", "--file_path", help = "File to be automated ") 
    args = parser.parse_args() 
    if args.input1 and args.input2 and args.file_path:
        print("Required fields not met")

6. Testing

With the program saved it can now be executed from command prompt .

Make sure your program and selenium driver are in the same location.

C:\Users\ashray\Desktop>python -i1 200000 -i2 200100 -fp “C:\Users\ashray\Desktop\test.xlsx”

Here -i1 and -i2 are the range of registration numbers we are extracting . -fp is the file path aong with the excel workbook name, where the extracted data is stored.

The output 

The output excel sheet look as seen in the image below

So with the help of python programming and the web-scraping technique, we are able to extract the student’s marks and thereby saving the lecturer’s time and effort.  

Here is the link to the full code on my Github Repositoty.

Ashray Pai

Mechanical Engineer | ML Enthusiast

Leave a Reply