Extract Text from PDF with Python

Подписаться 11 тыс.

Просмотров 39 тыс.

50% 1

In this video we learn how to extract text from a PDF file with Python using PyPDF2. We also learn how to convert PDF to a text file. We start off with a simple example of extracting text from a single page. We then extract the text from all the pages in the pdf. After this we use an example of getting text from pages that meet a certain condition (i.e., containing the word Waldo). With this example we learn how to extract text from multiple PDF pages that we specified. Next we write those extracted PDF Pages to a new PDF document. Finally we extract only the sentences that contain Waldo and the pages that those sentences were located on.
This is based on a real project I did for work where I had to extract pertinent information about specific people from thousands of PDFs that contained many pages each.
►►GitHub: github.com/bvalgard/working-w...
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
$15 off Annual Dataquest subscription
app.dataquest.io/referral-signup/qybqz3r8/
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
My Number 1 course recommendation for self learners (affiliate link): bit.ly/GoogleAnalyticsProfessionalCertificate
Udemy Recommendations that I have Personally Taken (affiliate links):
►►Learn Statistics bit.ly/Statistics4DSUdemyCE
►►Learn Python bit.ly/LearnPythonCE
►►Learn SQL bit.ly/LearnSQLCE
►►Learn Data Analysis (this goes into advanced concepts - learn up to and including Logistic regression - you don't need this before you start applying for jobs but it can help) bit.ly/PythonMLDS_CE
►►Learn Business Intelligence bit.ly/LearnBI_CE
►►Learn Time Series Analysis (this is an important skill in SOME jobs, but you don't need this before you start applying for jobs) bit.ly/TimeSeries_CE
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
More or my videos You may be interested in
►►Create PDF with Pyhton | Part 1 • Create PDF with Python...
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
Merch: bit.ly/PythonAndDataMerch
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
Consider subscribing for weekly tips, tricks, and tutorials. / @chartexplorers
Join my Discord Server / discord
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
References
realpython.com/creating-modif...
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
0:00 Intro - Where's Waldo
0:36 pip install
0:59 Extract Text
1:20 Step 1
2:09 Step 2
2:58 All Pages to txt
4:20 Where's Waldo Pages
5:51 Write to PDF
6:21 Get Text from Specific Pages
8:15 Waldo Sentences

Наука

Опубликовано:

7 авг 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 34

@MURALIKRISHNAhai 3 года назад

We love the work you do, Probably you might save someone's day everytime when you upload a new video Thank you 😊The way you include all the viewers in those appreciations after completing a task is awesome,ohhhoooo... we did great👍

@srikrithibharadwaj6779 2 года назад

Well explained. Thank you so much.

@TheBtrivedi Год назад

Amazing. Clear explanation of what's being done. Subscribed.

@shilpakale2699 2 года назад

Very nicely explained, I would like to know if page in pdf has header or footer and extract Page No's which has header/footer. Can we have this scripted using pyPDF2?Please advise

@abdullah07757 Год назад

Fun to watch hard to comprehend

@ajsunofficial6798 2 года назад

In case of page extraction say we want to extract page 2 and page 5..do we use in getobj.pageS(2,5)

@johnkhan174 2 года назад

Hello, I found a small bug in the code. If 'Waldo' exists in two places on the same page (in different sentences, the second 'Waldo' is not found. Can you provide a fix? Thanks!

@adamrassi3516 2 года назад

Hey, I'm new to this and using Thonny to edit and run code. When I get to exacting the text, a notepad file is opened but the text from the PDF is not written there. Any clue why this would happen.

@cbao97 8 месяцев назад

Amazing tutorial. I noticed there is a /n at the end of each line. Is there anything we can do to detect the whole paragraph?

@KyroAtelerix 2 года назад

And if I only eant to extraxt some keywords across multiplepages in 100 pdfs what I might do for it? I dont want all the text, only few words

@shahzan525 3 года назад

👍

@cars_worldcw488 2 года назад

Please how to avoid the line break problem for some paragraphs in your result ??

@yizzi25 2 года назад

Does the code work if there are multiple keywords in the same sentence?

@kingfunny4821 Год назад

How can take only highlight text in pdf

@vrbaac1641 2 года назад

hi, very nice video ^^... but will this procedure work if I need to extract certain text strings in PDFs generated from Autocad Drawings? thanks ^^

@ChartExplorers 2 года назад

Hi VR Bacc, good question. I'm assuming the text in the Autocad PDFs are "image" like which makes things trickier. Can you highlight the text on the PDF? If so, you should be able to grab the text using the methods described in this video. If not, then you will have to use something called pytesseract to get text from the pdf. Let me know if this is the case for you and I can try to make a video on how to do this.

@vrbaac1641 2 года назад

hi @@ChartExplorers thank you for your reply.. Actually I have tried some of the python procedures such as PyPDF2, 4 and others... but with no success... the script runs but there is no output... yes the text in the PDF from AutoCAD drawings are selectable regardless of the orientation... so I am thinking somehow there should be a way to get those texts but I am not sure how... We will greatly appreciate if you could make a tutorial about this ^^... thank you so much...

@ChartExplorers 2 года назад

I'll start working on a video. Would you feel comfortable sending me the pdf. If so, I'll send you my email and I'll practice on your PDF to everything works for your situation. If not, that's perfectly fine I have a few other examples I can use.

@vrbaac1641 2 года назад

@@ChartExplorers oh!!!! thank you so much... i think I can send you a portion of the drawing, specifically the title block area. The text on the pdf is selectable... I can send the file to you by email ^^...

@ChartExplorers 2 года назад

@VR Baac, awesome. This way I can make sure it works in your specific situation. My email is bradonvalgardson@gmail.com

@saurabhverma2155 2 года назад

Sir, can you guide me how to extract text (of specific coordinates) from pdf file ?

@ChartExplorers 2 года назад

Hmmmm, good question. I'm assuming by coordinates you mean on a specific page at a specific location (ex: page 2, 30 mm down and 20 mm right to 90 mm down to 50 mm right? Or something to that effect?

@TheKylesauce 2 года назад

i can't get my code to find the pdf i'm trying to use. does it need to be saved somewhere in particular?

@MindsetMinutess 2 года назад

you have to save it in the same directory with this .py code

@Atharv-wm3vr 2 года назад

how you installed pypdf, when i wrote it it says not found

@lasnroo 2 года назад

Hi duddy, is possible I read line by line? if yes, how I can?

@JM-fr9bc 3 года назад

How about using a for loop to extract a text title to another for multiple pdfs?

@ChartExplorers 3 года назад

Hi Johan, good question. I'm not sure exactly how to do this. I'll see what I can figure out but I wont be able to get to it for a few days. Can you send me another message in a few days time and see what I have come up with? Thanks!

@JM-fr9bc 3 года назад

@@ChartExplorers Thanks so much for your response. Just to re-state, I'm trying to use a loop to extract a particular section, let's say called "lessons", from a file of PDFs. and output into an excel. Any help would be great! Thanks so much!

@JM-fr9bc 3 года назад

@@ChartExplorers Hi! Were you able to come up with anything? :)

@KyroAtelerix 2 года назад

@@ChartExplorers do you solved it?

@dianamarahenry1020 2 года назад

I guess it doesn't matter if "the" is typed as "teh" as the presenter did!

@bryanl6300 Год назад

New to Python coding. Sorry for the stupid questions: I have ran the following CMD: pip install PyPDF2 Collecting PyPDF2 Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB) ---------------------------------------- 232.6/232.6 kB 4.8 MB/s eta 0:00:00 Installing collected packages: PyPDF2 Successfully installed PyPDF2-3.0.1 IDLE throws this error: from PyPDF2 import PdfFileReader ModuleNotFoundError: No module named 'PyPDF2' What am I missing???