Edward, one of our 11th grade students on the school’s yearbook team wanted to build an index of page numbers where each student appeared. Traditionally, this would entail multiple students going through every page and recording the page number for each student appearing on that page. Inevitably, some students get missed, errors are made, and some pages get skipped.
Edward, recognizing an opportunity to leverage his programming skills, wrote a Python script to collect the information in a much more efficient manner. His script reads two columns from a Google Spreadsheet into lists named name and nameraw. Then, for each name in the lists, the script searches each page of the yearbook pdf and logs each page number where the name is located. Finally, this data is written back to the Google sheet.
This script leverages Google sheets, but with minor alteration, it could easily use locally stored files instead. Note that this script depends on a number of dependencies, so don’t forget to add those that you may not yet have installed (PyPDF2, re, pathlib, and the relevant Google auth and client libraries). If you’re unfamiliar with the Google python libraries, well, that will need to be another Google search. However, if you use local files rather than reading from a Google sheet, those libraries (and the code between the dashed lines) could be excluded. You would just need to replace those lines with a couple others that would open your csv file with the list(s) of names.
This is Edward’s script and I had no hand in creating or using it. If you want to use it yourself, you’re welcome to do so. If you get stuck, check with one of your programming students or teachers first. If you’re still stuck, comment below and I’ll help out if I can.
from __future__ import print_function import os.path from PyPDF2 import PdfFileReader, PdfFileWriter from pathlib import Path import re import PyPDF2 #----------------------------------------------------- from google.auth.transport.requests import Request from google.oauth2.credentials import Credentials from googleapiclient.discovery import build from googleapiclient.errors import HttpError from google.oauth2 import service_account # Note that you will need to set up your own Google API credentials if # you're using Google Sheets. This part is beyond the scope of this post. # But you can find how to do this with a quick google search. SERVICE_ACCOUNT_FILE = 'keys.json' SCOPES = ['https://www.googleapis.com/auth/spreadsheets'] creds = None creds = service_account.Credentials.from_service_account_file( SERVICE_ACCOUNT_FILE, scopes=SCOPES) # The ID and range of the spreadsheet. Note that the ID below is fake # (we're not going to post the real one on the interwebs) # Make sure you update the ID to refer to your spreadsheet SPREADSHEET_ID = '1DQyjh1-7T9jjU7kmg6-TNcjsCtnCspWiMbv3t2mJeM' service = build('sheets', 'v4', credentials=creds) # Call the Sheets API sheet = service.spreadsheets() result = sheet.values().get(spreadsheetId=SPREADSHEET_ID, range="Sheet1!B1:B508").execute() result2 = sheet.values().get(spreadsheetId=SPREADSHEET_ID, range="Sheet1!A1:A508").execute() #------------------------------------------------------------ values = result.get('values') values2 = result2.get('values') name = [] nameraw = [] #put names from values and values2 into lists for x in values: for i in x: name.append(i) for x in values2: for i in x: nameraw.append(i) #Open the Yearbook pdf file pdf = PdfFileReader('yearbook1.pdf') #create a 2-D array for names & associated page numbers pagesfinal = [[]] #This is where everything happens. For each name, add it to pagesfinal. #Then open each page in the pdf file and search it for the current name. #If a version of the name is found, append the page number to the student's page list. for i in range(len(name)): print(nameraw[i]) pagesfinal.append([]) for page_num in range (pdf.numPages): pageobj = pdf.getPage(page_num) pageinfo = pageobj.extractText() pageinfo = ''.join(pageinfo.split()) name[i]= ''.join(str(name[i]).split()) nameraw[i] = ''.join(str(nameraw[i]).split()) if (re.search(name[i], pageinfo)) or (re.search(nameraw[i], pageinfo)) or (re.search(name[i].upper(), pageinfo)): pagesfinal[i].append(page_num +1) print(pagesfinal[i]) #When the above loop completes, the following line writes the data back to the Google sheet request = sheet.values().update(spreadsheetId=SPREADSHEET_ID, range="Sheet1!C1", valueInputOption = "USER_ENTERED", body={"values":pagesfinal}).execute()