Extract and Visualize Data from PDF Tables with PDFplumber in Python

Подписаться 148

Просмотров 14 тыс.

50% 1

Howdy all! I recently published a story that was based on some data analysis I did of a report I obtained from the Department of Behavioral Health and Developmental Services in VA. I wanted to share a quick walkthrough of how I extracted the data from tables in a PDF using a Python module called PDFplumber. Here's a link to the text version with the code - github.com/gam32bit/tdo
By using PDFplumber, I was able to create a graph which shows the trend at the center of my article. I hope some of you can take something away from this walkthrough that will help you supplement your own reporting, especially if you're interested in data journalism.
I'm by no means an expert coder, very much a beginner, so if there are things I could have done better let me know. That being said, I hope this walkthrough proves that any journalist can use programming to enhance their work, so you should try it if you haven't already!
PDFplumber docs - github.com/jsvine/pdfplumber
Python tutorials - / @socratica
jwcaterine.com
#python #walkthrough #journalism

Наука

Опубликовано:

25 июн 2023

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 17

@ramarisonandry8571 11 месяцев назад

I'm watching your video from Madagascar. Great job, thank you!

@JWCat757 11 месяцев назад

Wow! Very cool. Thanks for watching!

@virajmoghe2012 9 месяцев назад

This is amazing stuff. God bless you. Keep up the good work

@JWCat757 8 месяцев назад

Thank you!

@YashsCodeCamp 4 месяца назад

Thanks!

@user-lv9cv1cp8x 5 месяцев назад

Thanks a lottttt !!!!!!!!!!!!!!!!!

@JWCat757 5 месяцев назад

You’re welllllcommeeeeee!!!!!!!!!!

@gvenagas 2 месяца назад

I found that by opening a pdf file with Mozilla Firefox and inspecting it with the developer tools you can collect its text (with the help of JavaScript) after the web browser has converted it to HTML and maybe save it for further processing with someone programming language.

@JWCat757 15 дней назад

Interesting!

@cken27 8 месяцев назад

If you are interested in pdf table extraction, give "camelot" library a try. I found it superior than PDFplumber in terms of automatic table identification. It could detect bank statement tables without explicit lines and empty cells. Also, the resulting object is already a pandas Dataframe, so you can select and clean the data in the usual pandas way.

@JWCat757 8 месяцев назад

Thank you for sharing, I will definitely give it a try!

@ajarivas72 6 месяцев назад

@@JWCat757 Do both libraries work on tables built as images or vectorized images (selectable) ?

@JWCat757 6 месяцев назад

PDFplumber works with images, but it takes work to get it to read the table. See the "Visual Debugging" section of the ReadMe for more info - github.com/jsvine/pdfplumber#visual-debugging As for camelot, I'm not as familiar with it, but from what I can tell it doesn't seem to work with images. @@ajarivas72

@bxroberts Год назад

Great video! Do you know if the extract tables functionality needs the tables to be ruled?

@JWCat757 Год назад

Thank you! According to the PDFplumber docs, it will find both lines that are explicitly defined and/or implied by the alignment of words on the page, so my guess is that tables don't need to be ruled.

@bennguyen1313 5 месяцев назад

Not sure how to choose from the many python packages to extract data from a PDF.. PyMuPDF,PyPDF2 , PDFplumber, tabula-py, etc.. For example, what if the PDF is a scan of a paper document.. i.e. it's crooked, and quality is bad. Is there one that does it best? Or maybe I should use AI (ChatGPT + GPT4Vision/Ai PDF) to do an OCR, then have it extract the data? Also any suggestions how to get the values from specific columns in a text file. For example, I have a text file with data like this: #Time (HHH:MM:SS): 002:34:02 # T(ms) BUS CMD1 CMD2 FROM SA TO SA WC TXST RXST ERROR DT00 DT01 DT02 DT03 DT04 DT05 DT06 DT07 # ===== === ==== ==== ==== == ==== == == ==== ==== ====== ==== ==== ==== ==== ==== ==== ==== ==== 816 B0 D84E BC RT27 2 14 D800 2100 0316 0000 0000 0000 0000 CCCD 0000 817 A0 DC50 RT27 2 BC 16 D800 2120 0000 4080 3000 0000 3000 0000 0000 #Time (HHH:MM:SS): 002:34:03 # T(ms) BUS CMD1 CMD2 FROM SA TO SA WC TXST RXST ERROR DT00 DT01 DT02 DT03 DT04 DT05 DT06 DT07 # ===== === ==== ==== ==== == ==== == == ==== ==== ====== ==== ==== ==== ==== ==== ==== ==== ==== 056 B0 D84E BC RT27 2 14 D800 2100 0316 0000 0000 0000 0000 CCCD 0000 057 A0 DC50 RT27 2 BC 16 D800 2120 0000 4080 3000 0000 3000 0000 0000 How can get just the data from DT00 thru DT07 into an array, without doing lots of preprocessing to scrub out the repeating #Time headers that appear throughout the file?

@JWCat757 4 месяца назад

I don't have an exact answer to your question, but I will say that when I posted a specific problem like this to the discussions section of the PDFPlumber github I got a pretty quick and thorough response - github.com/jsvine/pdfplumber/discussions