Pdf Data Extraction Using Python | Pypdf2 Extract PDF Data to Excel | Extract Text From PDF to Excel

Python2020

Подписаться 3,1 тыс.

Просмотров 20 тыс.

50% 1

#pythonpdfautomation #pdfdataextraction #pypdf2 #extractdatafrompdftoexcel
00:00 Intro Extract Data From PDF using Python
02:00 Python PDF Data Extraction from PDF to Excel
03:30 Python use Regular Expression to get data from PDF
05:10 PYPDF2 extract data from pdf to excel
06:30 Python Openpyxl write pdf data to excel
FOR ALL OTHER VIDEOS VISIT BELOW LIN
/ python2020
As we now that PDF is the most frequent type of file which is used for digital communication
there are many type of different docs which created in pdf file type like invoice, receipts, policies, certificates and others
Python has got multiple libraries to work with pdf files
today in this video we will se how to extract the data from pdf files using python pdf2.
we will extract the data from different format of pdfs into excel file
we will use python regular expression to get relevant data from pdf extraction

Опубликовано:

12 июл 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 65

@danilsagidullin8116 Месяц назад

Thank you so much! I've broke my brain till was solving such task.

@Python2020 Месяц назад

☺

@ideationtosuccess5439 2 месяца назад

Fantastic. Exactly what I am looking for. Thanks buddy! One quick question. I found you are using 3 PDF files for this demo which are named file1, file2, file3. Should the PDF names should be in some sequential order for iterating each files through loop or can the files names be any?

@Python2020 2 месяца назад

No, that is just for example, you can use any way to loop on each file in folder... And if you want you can sort files by different properties

@elitebrightfuture4405 Год назад

Vedio for pypdf2 instalation because I have as issue which is download as file like word or internet explorer

@yashchavan7880 Год назад

can you please share the code used in this video ?

@miladmirzaei1762 Год назад

Hello, I have an error : Traceback (most recent call last): File "C:\Users\milad\PycharmProjects\pythonProject5\main.py", line 6, in for file_name in os.listdir('all_format_pdf'): FileNotFoundError: [WinError 3] The system cannot find the path specified: 'all_format_pdf' how can I fixed?

@Python2020 Год назад

Give complete path

@rajasekharreddy.g3952 2 года назад

do we need install pre-requisites before. please share how

@Python2020 2 года назад

Yes we have to... Check video no 95 for installing python library

@1stlookdigitalmedia Год назад

Sir I have one question,please answer me. I have pdf in regional language like hindi,marathi and font in pdf are not in unicode,then how can i extract data in excel from pdf.I need that pdf data in unicode within excel. Example:- Pdf file like voter lists in regional languages. Please answer me as i am trying all the time but all things are disapointing me. Thanks in advance

@Python2020 Год назад

You can intall fonts for local languages use that in the text... You need to reach to get the proper text output,, then regex concept remains same

@dilkashgazala831 Год назад

Hi can you please tell me is it possible to extract table of similar structures in different pdfs to an excel sheet using python

@Python2020 Год назад

Can share sample on hiteshb0101@gmail.com

@dilkashgazala831 Год назад

@@Python2020 hi there is some confidential data that I can't share, however I can brief my problem to you suppose I have three pdfs which constitute the details of students in a tabular format having the same schema but each of the pdf is from different institute. So, I want to extract the data present in the table from all three pdfs to an excel sheet using python. Kindly help me.

@Python2020 Год назад

Ok, in this I m already storing data into variables..you have to break star and end of table...Next is how to write data in cells for that watch video no 5

@Flixrin 2 года назад

When running the code at @2.39 I got a syntax error for for file_name. Any idea why? Im using Jupyter Notebook.

@Python2020 2 года назад

Share full line of your code

@Flixrin 2 года назад

@@Python2020 Hi, nevermind. The error was because I have not installed the module for pypdf2

@FMP_Media 2 года назад

I have another question please, how can i extract all pdf pages not only the first page ? Because some of my pdf files have like 14 pages, so how should i extract all of them ?

@Python2020 2 года назад

There is a line where I have mentioned zero from there you hv to include a loop after getting count of pages

@FMP_Media 2 года назад

@@Python2020 thank you it works 👍🏻❤️

@elitebrightfuture4405 Год назад

How u installed pypdf2 in ur pc

@Python2020 Год назад

No in python environment... Watch video 95 on my channel.. How to create new python environment

@md-mohammed8455 Год назад

Sir, in my case , I have scanned PDF files, I want to copy only a specific text from each page, Text sample xxxxx-xxx-xxx-IR-xxxxx I know only IR is fix on each page, before and after IR I don't know what is the text, but I want to copy all together from each page. Please help. If you and VBA macro please share. Or any other tools. Thanks

@Python2020 Год назад

There is another video on my channel.. Get Text from Scan pdf

@md-mohammed8455 Год назад

@@Python2020sir specific text also ?

@md-mohammed8455 Год назад

@@Python2020sir I searched but not found, please share link.

@Python2020 Год назад

Here..ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-Eg5pkNpYdmE.html

@kiraningale_ 2 года назад

Sir I want extract text fields and below table also, can we do that?

@Python2020 2 года назад

For table there is a different approch.. You have to identify some keyword in the start of table and end of table... In between you need to run the loop

@kiraningale_ 2 года назад

@@Python2020 I need specific column or specific row.as user want.

@Python2020 2 года назад

Use counter inside the loop... Or you can skip the column by identifying the text... It's a complax logic I know

@kibtiachowdhury6011 Год назад

Sir, I want to remove header, footer from every pages. Could you please help me?

@Python2020 Год назад

You can create new pdf... First extract the data from existing one and use that data to write new pdf... Refer video no 12 on my channel

@ritwikmishra4841 Год назад

Anyone knows, how to convert excel file sheets into pdf format using python.

@Python2020 Год назад

On my channel there is macro code may work for you

@lepdenlkr2427 Год назад

are you on fiverr ???

@Python2020 Год назад

@gourav0934 Год назад

I am getting an error "zipfile.badzipfile:file is not a zip file" can you pease help me

@Python2020 Год назад

Your file might be scanned one... Copy text manually and see if you are able to selcet paste on notepad

@rajasekharreddy.g3952 2 года назад

i am getting that error when tired to run the script ( 'tuple' object has no attribute 'seek')

@Python2020 2 года назад

Check last line of the error, copy paste last 2 lines in Google... Or post the line of code which is causing error

@rajasekharreddy.g3952 2 года назад

@@Python2020 this error i got line 7, in read_pdf = PyPDF2.PdfFileReader(load_pdf) in read stream.seek(-1, 2) AttributeError: 'tuple' object has no attribute 'seek'

@Python2020 2 года назад

@@rajasekharreddy.g3952 send me your file on hiteshb0101@gmail.com

@rajasekharreddy.g3952 2 года назад

@@Python2020 i sent file to your email

@Python2020 2 года назад

Can you pass full folder path in for loop and try... Just checked your mail in mobile

@hungsingtsoi9078 Год назад

for file_name in os.listdir('BAML'): print(file_name) #Loop on Files load_pdf = open(r'M:\Public Trade Operation\\Middle Office Package\\2023\\01 January\\Trade Confirmation\\US trades\\BAML\\'+file_name, 'rb') read_pdf = PyPDF2.PdfReader(load_pdf) #Load All Pdf in Variable page_count = len(read_pdf.pages) #Count the pdf pages first_page = read_pdf.pages[0] #read only the first page page_content = first_page.extract_text() #extract string output page_content = page_content.replace(' ','') print(page_content) Hello, I have multiple pdf in the folder, it did show all the pdf name when i print(file_name) However, it only print the content of the first pdf rather than all of them Would you please take a look, thx a lot

@Python2020 Год назад

You need to get all pages in list and then loop on each page.. Add code in loop bosy

@hungsingtsoi9078 Год назад

Can u provide some example I am new to python not quite familiar about it

@FMP_Media 2 года назад

Bro I've done exactly like you and installed the libraries required, but there's an error popping up in line 4 ( for file_name in os.Listdir('.........') ) The error is: FileNotFoundError: [WinError 3] The system cannot find the path spacified: '....' Do you have any solution might help please ?

@Python2020 2 года назад

Code is correct only check below.. It's complete folder path in os.Listdir ... Pdf file should be there in the folder ... Pdf should be text not scanned

@Python2020 2 года назад

Your error is relates to os. Listdir... Check in Google --iterate over pdf files in a folder using python

@FMP_Media 2 года назад

@@Python2020 Yeah the code is correct sure, even I run another code so simple code just to open a random file in python but I'm getting the same error always (FileNotFoundError...) Not for pdf files only, nah also for text, normal text files, same error.. I'm using pycharm too, really IDK what's exactly the problem I have thousands of pdf files scanned and not scanned, I need to extract the data from them and write it to excel, but can't do anything because of this error...

@FMP_Media 2 года назад

@@Python2020 finally I found the solution, it's because wrong path, I run ( os.chdir ) to change the path then I put the path of my pdf files and it works, anyway thank you man for your time 👍🏻

@harishbollineni2588 2 года назад

Can you please send me the code sir.

@Python2020 2 года назад

Hi Harish, I have explained the code in the video, if you have doubt at any point mention the time and question... I dont keep files saved... Let me know if you face any error or so

@FMP_Media 2 года назад

My last question 😁 I'm trying to extract this data from the pdf files into just 2 columns in excel, first column is the pdf file names and second column is the text what I have extracted for each pdf file name, I used your method in the video but it only works for the pdf file names in column 1 and second column for the text what I extracted no, always when I run the code for both columns it pops up long error [ Traceback ( most recent call last): ........... in check_string raise IllegalCharacterError openpyxl.utils.exceptions.IllegalCharacterError ] And when I run the code just for the first column, it works well and it writes the pdf file names in the first column in excel. l just want to write the text what I have extracted for each pdf file in excel in column 2 I don't want to write specific details like names and addresses and mob no I want to write in excel the whole extracted pdf text, if you have any tips or solution for that please tell me and thank you so much 🤗

@Python2020 2 года назад

As per error, there should string encoding issu that is when illigal char error comes, you can try trim, or change encoding,, for fetching specific value you can use slicing or regex

@FMP_Media 2 года назад

@@Python2020 alright thank you 🙏🏻

@anushyaa5442 Год назад

hi sir, i have doubt, i need extract specified text like email -xxxxx@xxxx phone nos-(xxx)xxx-xxxx and name in a sheet and convert the data into excel or csv.wrote this program. plz help to solve . code mentioned below. import PyPDF2 import openpyxl import re import pytesseract as tess tess.pytesseract.tesseract_cmd=r"C:/Tesseract-OCR/tesseract.exe" from PIL import Image excel=openpyxl.Workbook() sheet=excel.active sheet.title='pdf' sheet.append(['phone number']) pdfFileObj = open('C:\Program Files\Python310/filename.pdf', 'rb') pdfReader = PyPDF2.PdfFileReader(pdfFileObj) num_pages = pdfReader.numPages count = 0 text = "" #The while loop will read each page. while count < num_pages: pageObj = pdfReader.getPage(count) count +=1 text += pageObj.extractText() text=text.replace(' ','') #print(text) print('---------') """ """ phone = re.findall('\(\d{3}\)\d{3}-\d{4}', text) #print(phone) zip_code=re.findall('\d{5}',text) my_zip=set(zip_code) #print(my_zip) email=re.findall('@*?\.',text) print (email) sheet.append(['phone','zip_code']) excel.save('C:\Program Files\Python310/file.xlsx') print('DONE!!')

@Python2020 Год назад

After reading pdf use reqular expressions, use variable to append data in cav

@anushyaa5442 Год назад

@@Python2020 not able to understand some example plz