This is a two-part series on performing python web scraping. In our first part, Web Scraping with Python, we talk about the basics, ethics, and prerequisites of python web scraping. In this article, we will be talking about the programming side of scraping. So, let’s get straight into it.
NOTE: The entire code can be found on my GitHub page. The link is provided at the end of the article
Building the Web Scraper
The program has been written in Jupyter notebook.The following are steps to write the program:
- Import all the required library
- Create a new excel workbook
- Function to take in extracted data and enter it in the created excel sheet.
- Automate: for each student, enter the ID, fetch the result, and parse the page to get the entire scorecard.
- Take arguments from the user and run the program
1. Import all the required library:
import requests import bs4 from selenium import webdriver from selenium.webdriver.common.keys import Keys import openpyxl import argparse
2. Create a new excel workbook
Here we are creating a function that will create a new workbook, rename the active sheet, insert the required headers, and save it in a path specified by us.
The path is sent to this function through the ‘f_loc’ variable from the command line argument
# Function named crtnewwb- will be called once.
def crtnewwb(f_loc):
# Calling a Workbook() function of openpyx to create a new blank Workbook object and assigning it to variable f_wb
f_wb = openpyxl.Workbook()
# Getting workbook's active sheet using the active attribute and assign it to a variable f_sheet
f_sheet = f_wb.active
# changing the name of the sheet
f_sheet.title = "2nd PUC"
# Adding the Column names
f_sheet.cell(1,1).value='Reg Num'
f_sheet.cell(1,2).value='Name'
f_sheet.cell(1,3).value='Subject-1'
f_sheet.cell(1,4).value='Subject-2'
f_sheet.cell(1,5).value='Subject-3'
f_sheet.cell(1,6).value='Subject-4'
f_sheet.cell(1,7).value='Subject-5'
f_sheet.cell(1,8).value='Subject-6'
f_sheet.cell(1,9).value='Total'
#Saving the file in the location provided by you
f_wb.save(f_loc)
3. Storing the extracted data to files
This function takes in the “row_num” (which makes sure each student entry is captured in a new row), all the extracted data (student’s registration number, name , marks of 6 subject ,total) ,file location and assigns them into the excel workbook. It data is saved each time this function is called
# Function to call the workbook and pass in extracted data
def inpt(f_loc,row_num,name,sub1,sub2,sub3,sub4,sub5,sub6,total,reg):
#opening the workbook
f_wb = openpyxl.load_workbook(f_loc)
#selecting the sheet where we have to input data
f_sheet = f_wb['2nd PUC']
#inputting the data in to respective fields
f_sheet.cell(row_num,1).value = reg
f_sheet.cell(row_num,2).value = name
f_sheet.cell(row_num,3).value = sub1
f_sheet.cell(row_num,4).value = sub2
f_sheet.cell(row_num,5).value = sub3
f_sheet.cell(row_num,6).value = sub4
f_sheet.cell(row_num,7).value = sub5
f_sheet.cell(row_num,8).value = sub6
f_sheet.cell(row_num,9).value = total
#keeping a check as to how many rows are entered
print(row)
#saving the file
f_wb.save(f_loc)
4. Python Web Scraping Automation
When the file path and the range of students registration number is passes to this function, it opens the web browser enter the Registration number, extracts the data and send it to the “inpt” function which saves the data. It is extraction of the data with the help of a beautiful soup package and the tags ‘reg’ and ‘td’ , that was identified earlier. There will be cases where students’s registration number does not exist due to which the progmean runs into error. To tackle such situations, the ‘try’ and ‘except’ error handling is used.
def automation(f_loc,inpt1,inpt2):
#initializing chrome browser as selenium webdriver to variable browser
browser = webdriver.Chrome()
#calling the function to create new workbook
crtnewwb(f_loc)
#initializing row =2 as the 1st roe in workbook are the heading
rows = 2
#fetching the HTML of the website
browser.get('http://karresults.nic.in/indexPUC_2020.asp')
#extracting data for the given range
for i in range(int(inpt1),int(inpt2)):
try:
ele = browser.find_element_by_name('reg')
ele.clear()
ele.send_keys(str(i))
ele.send_keys(Keys.RETURN)
soup = bs4.BeautifulSoup(browser.page_source,'lxml')
req_data = soup.select('td')
name = req_data[1].getText().strip()
sub1 = req_data[7].getText().strip()
sub2= req_data[11].getText().strip()
sub3 =req_data[17].getText().strip()
sub4 =req_data[21].getText().strip()
sub5 =req_data[25].getText().strip()
sub6 =req_data[29].getText().strip()
gtm =req_data[33].getText().strip()
inpt(f_loc,rows,name,sub1,sub2,sub3,sub4,sub5,sub6,gtm,i)
rows+=1
browser.back()
except:
browser.back()
continue
5. Running the script
This is the main program which allows the user to input the required field and then performs all the operations by calling the functions defined previously..
if __name__ == '__main__':
#initializing variable parser to argument_parser
parser = argparse.ArgumentParser()
#adding 3 arguments: 2 for the range of Registration number and one for file path
parser.add_argument("-i1", "--input1", help = "file path of final workbook")
parser.add_argument("-i2", "--input2", help = "File to be automated ")
parser.add_argument("-fp", "--file_path", help = "File to be automated ")
args = parser.parse_args()
if args.input1 and args.input2 and args.file_path:
automation(args.file_path,args.input1,args.input2)
else:
print("Required fields not met")
6. Testing
With the program saved it can now be executed from command prompt .
Make sure your program and selenium driver are in the same location.
C:\Users\ashray\Desktop>python Web_scraping.py -i1 200000 -i2 200100 -fp “C:\Users\ashray\Desktop\test.xlsx”
Here -i1 and -i2 are the range of registration numbers we are extracting . -fp is the file path aong with the excel workbook name, where the extracted data is stored.
The output
The output excel sheet look as seen in the image below
So with the help of python programming and the web-scraping technique, we are able to extract the student’s marks and thereby saving the lecturer’s time and effort.
Here is the link to the full code on my Github Repositoty.