What is PDF?
PDFs are a popular format for distributing text. PDF is an abbreviation for Portable Document Format, and it utilizes the .pdf file extension. Adobe Systems designed it in the early 1990s.
Reading PDF documents in Python can assist you in automating a wide range of operations.
PyPDF2 module:
Python includes a variety of built-in functions. To count the pages of a PDF file, we can use the Python inbuilt library ‘PyPDF2’.
The Python module PyPDF can be used to accomplish what we want (text extraction), but it can also do more. This library can also generate, decrypt, and merge PDF files.
Before we work with the module PyPDF2 we should first install it.
Installation:
pip install PyPDF2
- How to Get the Count of Number of Pages in a PDF File in Python?
- How to Delete Pages from a PDF File in Python?
- Convert PDF to TXT file using Python
Finding the Page Number of a Text from a PDF File in Python
Method #1: Using Built-in Functions (Static Input)
Approach:
- Import PyPDF2 module using the import keyword
- Import re module using the import keyword
- Read the PDF file in read mode by passing it as an argument to the PdfFileReader() of the function PyPDF2 module and store it in a variable.
- Apply getNumPages() function on the above pdf file to get all the page numbers and store it in another variable.
- Give the string as static input and store it in a variable.
- Loop till the end page of the pdf using the for loop.
- Pass the pagenumber to the getPage() function and apply it to the given pdf object to get the page object of the corresponding page number.
- Apply extractText() function on the above page object to get the text from the page
- Pass the given string and the above page text to the search() function of the re module
- Here it searches the given string in the page text
- If it is true, then print the corresponding page number.
- The Exit of the Program.
Below is the implementation:
Import PyPDF2 module using the import keyword Import re module using the import keyword Read the PDF file in read mode by passing it as an argument to the PdfFileReader() of the function PyPDF2 module and store it in a variable. Apply getNumPages() function on the above pdf file to get all the page numbers and store it in another variable. Give the string as static input and store it in a variable. Loop till the end page of the pdf using the for loop. Pass the pagenumber to the getPage() function and apply it to the given pdf object to get the page object of the corresponding page number. Apply extractText() function on the above page object to get the text from the page Pass the given string and the above page text to the search() function of the re module Here it searches the given string in the page text If it is true, then print the corresponding page number
Output:
The given string { watching } is in the page number: 2
Method #2: Using Built-in Functions (User Input)
Approach:
- Import PyPDF2 module using the import keyword
- Import re module using the import keyword
- Read the PDF file in read mode by passing it as an argument to the PdfFileReader() of the function PyPDF2 module and store it in a variable.
- Apply getNumPages() function on the above pdf file to get all the page numbers and store it in another variable.
- Give the string as user input using the input() function and store it in a variable.
- Loop till the end page of the pdf using the for loop.
- Pass the pagenumber to the getPage() function and apply it to the given pdf object to get the page object of the corresponding page number.
- Apply extractText() function on the above page object to get the text from the page
- Pass the given string and the above page text to the search() function of the re module
- Here it searches the given string in the page text
- If it is true, then print the corresponding page number.
- The Exit of the Program.
Below is the implementation:
# Import PyPDF2 module using the import keyword import PyPDF2 # Import re module using the import keyword import re # Read the PDF file in read mode by passing it as an argument to the PdfFileReader() # of the function PyPDF2 module and store it in a variable. pdf_object = PyPDF2.PdfFileReader(r"sample.pdf") # Apply getNumPages() function on the above pdf file to get all the page numbers # and store it in another variable. pagenumbers = pdf_object.getNumPages() # Give the string as user input using the input() function and store it in a variable. gvn_str = input("Enter some random string = ") # Loop till the end page of the pdf using the for loop. for pageno in range(0, pagenumbers): # Pass the pagenumber to the getPage() function and apply it to the given pdf object # to get the page object of the corresponding page number. Page_object = pdf_object.getPage(pageno) # Apply extractText() function on the above page object to get the text # from the page rslt_text = Page_object.extractText() # Pass the given string and the above page text to the search() function of the # re module # Here it searches the given string in the page text if re.search(gvn_str, rslt_text): # If it is true, then print the corresponding page number print("The given string {",gvn_str,"} is in the page number: " + str(pageno+1))
Output:
Enter some random string = more The given string { more } is in the page number: 1 The given string { more } is in the page number: 2
Here the string “word” is found in both 1 and 2 pages. So, it returns both pages.