How to extract only specific text from PDF file using python

How to extract some of the specific text only from PDF files using python and store the output data into particular columns of Excel. Here is the sample input PDF file (File.pdf) Link to the full PDF file File.pdf We need to extract the value of Invoice Number, Due Date and Total Due from the whole PDF file. Script i have used so far:

 from io import StringIO from pdfminer.converter import TextConverter from pdfminer.layout import LAParams from pdfminer.pdfdocument import PDFDocument from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter from pdfminer.pdfpage import PDFPage from pdfminer.pdfparser import PDFParser output_string = StringIO() with open('file.pdf', 'rb') as in_file: parser = PDFParser(in_file) doc = PDFDocument(parser) rsrcmgr = PDFResourceManager() device = TextConverter(rsrcmgr, output_string, laparams=LAParams()) interpreter = PDFPageInterpreter(rsrcmgr, device) for page in PDFPage.create_pages(doc): interpreter.process_page(page) print(output_string.getvalue())

But not getting the specific output value from the PDF file . asked Sep 30, 2020 at 16:55 595 7 7 silver badges 32 32 bronze badges can you provide en example pdf file (as pdf, not as picture) Commented Sep 30, 2020 at 17:59 @D-E-N - Please find the Link updated in the question only. Commented Oct 1, 2020 at 3:14

2 Answers 2

If you want to find the data in in your way (pdfminer), you can search for a pattern to extract the data like the following (new is the regex at the end, based on your given data):

from io import StringIO import re from pdfminer.converter import TextConverter from pdfminer.layout import LAParams from pdfminer.pdfdocument import PDFDocument from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter from pdfminer.pdfpage import PDFPage from pdfminer.pdfparser import PDFParser output_string = StringIO() with open('testfile.pdf', 'rb') as in_file: parser = PDFParser(in_file) doc = PDFDocument(parser) rsrcmgr = PDFResourceManager() device = TextConverter(rsrcmgr, output_string, laparams=LAParams()) interpreter = PDFPageInterpreter(rsrcmgr, device) for page in PDFPage.create_pages(doc): interpreter.process_page(page) finding = re.search(r"INV-\d+\n\d+\n.+\n.+\n\$\d+\.\d+", output_string.getvalue()) invoice_no, order_no, _, due_date, total_due = finding.group(0).split("\n") print(invoice_no, order_no, due_date, total_due)

If you want to store the data in excel, you may have to be more specific (or open a new question) or look on these pages:

PS: the other answer looks like a good solution, you only have to filter the data

EDIT: Second solution. Here I use another package PyPDF2 , because there you get the data in an other order (maybe this is possible with PDFMiner, too). If the text before the values are always the same, you can find the data like this:

import re import PyPDF2 def parse_pdf() -> list: with open("testfile.pdf", "rb") as file: fr = PyPDF2.PdfFileReader(file) data = fr.getPage(0).extractText() regex_invoice_no = re.compile(r"Invoice Number\s*(INV-\d+)") regex_order_no = re.compile(r"Order Number(\d+)") regex_invoice_date = re.compile(r"Invoice Date(\S+ \d, \d)") regex_due_date = re.compile(r"Due Date(\S+ \d, \d)") regex_total_due = re.compile(r"Total Due(\$\d+\.\d)") invoice_no = re.search(regex_invoice_no, data).group(1) order_no = re.search(regex_order_no, data).group(1) invoice_date = re.search(regex_invoice_date, data).group(1) due_date = re.search(regex_due_date, data).group(1) total_due = re.search(regex_total_due, data).group(1) return [invoice_no, due_date, total_due] if __name__ == '__main__': print(parse_pdf())

Maybe you have to change the regexes, because they are only based on the given example. The regexes are only working if they find the regex, so you have to work with try: except per regex ;)
If this does not answer your question, you have to provide more information/example pdfs.