How to Find the Page Number of a Text from a PDF File in Python?

What is PDF?

PDFs are a popular format for distributing text. PDF is an abbreviation for Portable Document Format, and it utilizes the .pdf file extension. Adobe Systems designed it in the early 1990s.

Reading PDF documents in Python can assist you in automating a wide range of operations.

PyPDF2 module:

Python includes a variety of built-in functions. To count the pages of a PDF file, we can use the Python inbuilt library ‘PyPDF2’.

The Python module PyPDF can be used to accomplish what we want (text extraction), but it can also do more. This library can also generate, decrypt, and merge PDF files.

Before we work with the module PyPDF2 we should first install it.

Installation:

pip install PyPDF2

Finding the Page Number of a Text from a PDF File in Python

Method #1: Using Built-in Functions (Static Input)

Approach:

  • Import PyPDF2 module using the import keyword
  • Import re module using the import keyword
  • Read the PDF file in read mode by passing it as an argument to the PdfFileReader() of the function PyPDF2 module and store it in a variable.
  • Apply getNumPages() function on the above pdf file to get all the page numbers and store it in another variable.
  • Give the string as static input and store it in a variable.
  • Loop till the end page of the pdf using the for loop.
  • Pass the pagenumber to the getPage() function and apply it to the given pdf object to get the page object of the corresponding page number.
  • Apply extractText() function on the above page object to get the text from the page
  • Pass the given string and the above page text to the search() function of the re module
  • Here it searches the given string in the page text
  • If it is true, then print the corresponding page number.
  • The Exit of the Program.

Below is the implementation:

Import PyPDF2 module using the import keyword
Import re module using the import keyword
Read the PDF file in read mode by passing it as an argument to the PdfFileReader()
of the function  PyPDF2 module and store it in a variable.
Apply getNumPages() function on the above pdf file to get all the page numbers
and store it in another variable.
Give the string as static input and store it in a variable.
Loop till the end page of the pdf using the for loop.
Pass the pagenumber to the getPage() function and apply it to the given pdf object
to get the page object of the corresponding page number.
Apply extractText() function on the above page object to get the text
from the page
Pass the given string and the above page text to the search() function of the
re module
Here it searches the given string in the page text
If it is true, then print the corresponding page number

Output:

The given string { watching } is in the page number: 2

Method #2: Using Built-in Functions (User Input)

Approach:

  • Import PyPDF2 module using the import keyword
  • Import re module using the import keyword
  • Read the PDF file in read mode by passing it as an argument to the PdfFileReader() of the function PyPDF2 module and store it in a variable.
  • Apply getNumPages() function on the above pdf file to get all the page numbers and store it in another variable.
  • Give the string as user input using the input() function and store it in a variable.
  • Loop till the end page of the pdf using the for loop.
  • Pass the pagenumber to the getPage() function and apply it to the given pdf object to get the page object of the corresponding page number.
  • Apply extractText() function on the above page object to get the text from the page
  • Pass the given string and the above page text to the search() function of the re module
  • Here it searches the given string in the page text
  • If it is true, then print the corresponding page number.
  • The Exit of the Program.

Below is the implementation:

# Import PyPDF2 module using the import keyword
import PyPDF2
# Import re module using the import keyword
import re
# Read the PDF file in read mode by passing it as an argument to the PdfFileReader()
# of the function  PyPDF2 module and store it in a variable.
pdf_object = PyPDF2.PdfFileReader(r"sample.pdf")
# Apply getNumPages() function on the above pdf file to get all the page numbers
# and store it in another variable.
pagenumbers = pdf_object.getNumPages()
# Give the string as user input using the input() function and store it in a variable.
gvn_str = input("Enter some random string = ")
# Loop till the end page of the pdf using the for loop.
for pageno in range(0, pagenumbers):
    # Pass the pagenumber to the getPage() function and apply it to the given pdf object
    # to get the page object of the corresponding page number.
    Page_object = pdf_object.getPage(pageno)
    
    # Apply extractText() function on the above page object to get the text
    # from the page   
    rslt_text = Page_object.extractText()
    # Pass the given string and the above page text to the search() function of the 
    # re module 
    # Here it searches the given string in the page text
    if re.search(gvn_str, rslt_text):
         # If it is true, then print the corresponding page number
         print("The given string {",gvn_str,"} is in the page number: " + str(pageno+1))

Output:

Enter some random string = more
The given string { more } is in the page number: 1
The given string { more } is in the page number: 2

Here the string “word” is found in both 1 and 2 pages. So, it returns both pages.