Extract Text from any PDF File in Python 3.10 Tutorial

Подписаться 215 тыс.

Просмотров 49 тыс.

50% 1

Today we will be learning how we can extract the text from PDF files in Python 3.10, so that we can later process that text in any way we please.
▶ Become job-ready with Python:
www.indently.io
▶ Follow me on Instagram:
/ indentlyreels

Опубликовано:

8 окт 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 39

@tobiwie Год назад

In some of the latest updates to PyPDF2 the class "PdfFileReader" got replaced with "PdfReader". Code still works fine with "PdfReader". :)

@frapsg2 6 месяцев назад

Awesome, so helpful! That's much simpler and ready-to-use compared to all others approaches found online. Is there a way to export the extracted text to a csv or xlsx file?

@vitaliibaglaiev4147 5 месяцев назад

Just amazing explanation, short and sweet!

@vishnumuralidhar5659 11 месяцев назад

Thanks for the awesome tutorial. Please do the video for two sided pdfs. Which wasnt there on youtube🙃

@akashnath7999 2 года назад

It's so helpful...loved it ❤

@Indently 2 года назад

Glad it helped! :)

@Mike_elGreco 7 месяцев назад

It worked! Thank you !!

@albeeshi 10 месяцев назад

How to extract data from more than one PDF file and put it in a table

@abigailmapuladikobo9941 5 месяцев назад

Got an answer?

@kevinmakumbe 8 месяцев назад

Nice tutorial, how can i get the cordinates of the text in my pdf file?

@davet4335 Год назад

The code did not work for me on a Windows 11 PC. I kept having ChatGPT analyze the code and error messages and after many tires it fixed it: import os import PyPDF2 import re import math def extract_text_from_pdf(pdf_file: str) -> [str]: # Open the PDF file of your choice with open(pdf_file, 'rb') as pdf: reader = PyPDF2.PdfReader(pdf) pdf_text = [] for page in reader.pages: content = page.extract_text() pdf_text.append(content) return pdf_text def main(): extracted_text = extract_text_from_pdf('sample.pdf') for text in extracted_text: print(text) if __name__ == '__main__': main()

@Absolute_gamerz 7 месяцев назад

Thanks !

@milans2373 7 месяцев назад

Thank you so fucking much i got crazy over this

@talhafaiz3597 2 месяца назад

Thanks a lot mate!

@オタヴィオルイス Год назад

helped me a lot. Thanks

@gvenagas 4 месяца назад

I found that by opening a pdf file with Mozilla Firefox and inspecting it with the developer tools you can collect its text (with the help of JavaScript) after the web browser has converted it to HTML and maybe save it for further processing with someone programming language.

@mohammedasimsameer1220 8 месяцев назад

Thank you bro

@boukefmohamed3191 6 месяцев назад

Excellent

@MedoHamdani 5 месяцев назад

Will it work on Arabic language and will it be able to extract hand written manuscript?

@Miyazaki97 Год назад

Thank you for the awesome tutorial. I have a some question about extracting articles. I hope you can help me. While extracting articles and reports there are many references and table legends, titles which is not required. Would it be possible to remove all those references and table contents including legends and titles when extracting the pdf file?

@mehdismaeili3743 Год назад

great as always.

@valmirrastelyjunior9400 9 месяцев назад

Great

@rishikeshchava6895 5 месяцев назад

Hey , I have some 600 files which have large volume of data, text extraction using pypdf2 is taking a lot of time , is there any other way to do this ?

@gulfamhussain9674 2 месяца назад

Do you have any solution for pdfs with characters because when I try to apply this solution on those pdfs it prints gibberish characters.

@rs-nm7hp 2 года назад

U r awesome 👏

@Indently 2 года назад

Thanks! :)

@jvwee Год назад

I am pretty sure there are over a thousand isntances of the word "coffee" in the pdf. However, this seems to have only counted the number of pages that the word appeared.

@Sathishedutech Год назад

Hi sir..is it Work on Local Language Like Telugu

@zainsaqib3702 Год назад

I keep on getting Syntax Error: unmatched ')' on line 4 I'm running python 3.9 could that be the case?

@atharkhalid3275 Год назад

what if we want to extract text for any particular page

@louis19449 9 месяцев назад

how do you add the pdf file to the project?

@gianlucagiannetto5146 3 месяца назад

I wrote the code line per line, word for word but it continue to give me File not found, how it's possible? p.s. I managed to extrat text, the only problem is the layout of the answer, i have a string long miles

@enkvadrat_ 2 месяца назад

def convert_pdf_to_text(pdf_path): with pdfplumber.open(pdf_path) as pdf: for page in pdf.pages: text = page.extract_text(layout=True) print(text) return text

@MedoHamdani 4 месяца назад

So this is not an OCR