Python web scraping – Part 2
This is a two-part series on performing python web scraping. In our first part, Web Scraping with Python, we talk about the basics, ethics, and prerequisites of python web scraping. In this article, we will be talking about the programming side of scraping. So, let’s get straight into it.
NOTE: The entire code can be found on my GitHub page. The link is provided at the end of the article
Building the Web Scraper
The program has been written in Jupyter notebook.The following are steps to write the program:
- Import all the required library
- Create a new excel workbook
- Function to take in extracted data and enter it in the created excel sheet.
- Automate: for each student, enter the ID, fetch the result, and parse the page to get the entire scorecard.
- Take arguments from the user and run the program
1. Import all the required library:
import requests import bs4 from selenium import webdriver from selenium.webdriver.common.keys import Keys import openpyxl import argparse
2. Create a new excel workbook
Here we are creating a function that will create a new workbook, rename the active sheet, insert the required headers, and save it in a path specified by us.
The path is sent to this function through the ‘f_loc’ variable from the command line argument
# Function named crtnewwb- will be called once. def crtnewwb(f_loc): # Calling a Workbook() function of openpyx to create a new blank Workbook object and assigning it to variable f_wb f_wb = openpyxl.Workbook() # Getting workbook's active sheet using the active attribute and assign it to a variable f_sheet f_sheet = f_wb.active # changing the name of the sheet f_sheet.title = "2nd PUC" # Adding the Column names f_sheet.cell(1,1).value='Reg Num' f_sheet.cell(1,2).value='Name' f_sheet.cell(1,3).value='Subject-1' f_sheet.cell(1,4).value='Subject-2' f_sheet.cell(1,5).value='Subject-3' f_sheet.cell(1,6).value='Subject-4' f_sheet.cell(1,7).value='Subject-5' f_sheet.cell(1,8).value='Subject-6' f_sheet.cell(1,9).value='Total' #Saving the file in the location provided by you f_wb.save(f_loc)
3. Storing the extracted data to files
This function takes in the “row_num” (which makes sure each student entry is captured in a new row), all the extracted data (student’s registration number, name , marks of 6 subject ,total) ,file location and assigns them into the excel workbook. It data is saved each time this function is called
# Function to call the workbook and pass in extracted data def inpt(f_loc,row_num,name,sub1,sub2,sub3,sub4,sub5,sub6,total,reg): #opening the workbook f_wb = openpyxl.load_workbook(f_loc) #selecting the sheet where we have to input data f_sheet = f_wb['2nd PUC'] #inputting the data in to respective fields f_sheet.cell(row_num,1).value = reg f_sheet.cell(row_num,2).value = name f_sheet.cell(row_num,3).value = sub1 f_sheet.cell(row_num,4).value = sub2 f_sheet.cell(row_num,5).value = sub3 f_sheet.cell(row_num,6).value = sub4 f_sheet.cell(row_num,7).value = sub5 f_sheet.cell(row_num,8).value = sub6 f_sheet.cell(row_num,9).value = total #keeping a check as to how many rows are entered print(row) #saving the file f_wb.save(f_loc)
4. Python Web Scraping Automation
When the file path and the range of students registration number is passes to this function, it opens the web browser enter the Registration number, extracts the data and send it to the “inpt” function which saves the data. It is extraction of the data with the help of a beautiful soup package and the tags ‘reg’ and ‘td’ , that was identified earlier. There will be cases where students’s registration number does not exist due to which the progmean runs into error. To tackle such situations, the ‘try’ and ‘except’ error handling is used.
def automation(f_loc,inpt1,inpt2): #initializing chrome browser as selenium webdriver to variable browser browser = webdriver.Chrome() #calling the function to create new workbook crtnewwb(f_loc) #initializing row =2 as the 1st roe in workbook are the heading rows = 2 #fetching the HTML of the website browser.get('http://karresults.nic.in/indexPUC_2020.asp') #extracting data for the given range for i in range(int(inpt1),int(inpt2)): try: ele = browser.find_element_by_name('reg') ele.clear() ele.send_keys(str(i)) ele.send_keys(Keys.RETURN) soup = bs4.BeautifulSoup(browser.page_source,'lxml') req_data = soup.select('td') name = req_data[1].getText().strip() sub1 = req_data[7].getText().strip() sub2= req_data[11].getText().strip() sub3 =req_data[17].getText().strip() sub4 =req_data[21].getText().strip() sub5 =req_data[25].getText().strip() sub6 =req_data[29].getText().strip() gtm =req_data[33].getText().strip() inpt(f_loc,rows,name,sub1,sub2,sub3,sub4,sub5,sub6,gtm,i) rows+=1 browser.back() except: browser.back() continue
5. Running the script
This is the main program which allows the user to input the required field and then performs all the operations by calling the functions defined previously..
if __name__ == '__main__': #initializing variable parser to argument_parser parser = argparse.ArgumentParser() #adding 3 arguments: 2 for the range of Registration number and one for file path parser.add_argument("-i1", "--input1", help = "file path of final workbook") parser.add_argument("-i2", "--input2", help = "File to be automated ") parser.add_argument("-fp", "--file_path", help = "File to be automated ") args = parser.parse_args() if args.input1 and args.input2 and args.file_path: automation(args.file_path,args.input1,args.input2) else: print("Required fields not met")
6. Testing
With the program saved it can now be executed from command prompt .
Make sure your program and selenium driver are in the same location.
C:\Users\ashray\Desktop>python Web_scraping.py -i1 200000 -i2 200100 -fp “C:\Users\ashray\Desktop\test.xlsx”
Here -i1 and -i2 are the range of registration numbers we are extracting . -fp is the file path aong with the excel workbook name, where the extracted data is stored.
The output
The output excel sheet look as seen in the image below
So with the help of python programming and the web-scraping technique, we are able to extract the student’s marks and thereby saving the lecturer’s time and effort.
Here is the link to the full code on my Github Repositoty.