Damn, that was awesome! I can only imagine, what goes through your head, when you simply write lines of code like if it was normal sentence and it actually works! Well done, you've deserved your like and subscribe.
Wow fantastic video! Very informative and shows how to parse PDFs in the real world! Well done. This will help me learn about Pythonic parsing from various data sources. Please make equivalent for MS Word docx files. Thank you! 👍
Love it...I'm new to python...learning as I want to do some data analysis on a dataset that can only be generated in a poorly laid out pdf (don't ask why!) So this a great intro / guide to some fantastic functions which gives me hope :-)
Thank you man soo much earlier i was working with PyPDF2 but it was not providing proper spacing while reading the file but with the help of pdfplumber i fixed this issue. and now code is running completely fine. Thanks man for helping with the name of library
Thank you dude! I finally found a real tutorial that helps me on my way getting data out of a pdf that Isn’t easy and well structured. I have another problem that you aren’t facing here. My invoice has items that are split on multiple pages and I need to figure out how to assign that data from the second page to the invoice item of the page before!
That’s a fun challenge - similar to when you have multiple lines you want to assign to one. You have a variable that tracks where you are in the page / document, and use that to determine when to combine / assign info as needed.
@@PythonicAccountant yes, I’m hyped to test it tomorrow, I tried some other libraries too, but none of them worked for me. In the first place I wanted to do it with OCR but quickly found out that there are easier options when the pdf is searchable :D but I also face an other Problem where the PDF isn’t searchable and these pdfs contain handwritten text + copied invoices, I worked with easyOCR at some point in a different project but it‘s not working for me right now
This is super helpful. Thank you so much for sharing. One question that I have is on missing values. I see that your invoice happens to have no missing values for the columns . What if I have missing values in few columns. Is it possible to read those missing/ blank values of the table also while reading the text to maintain consistency of columns.
This is very helpful, thank you. I was wondering, is there a way to search for properties besides the text itself? For example, say every vendor number is always highlighted in blue, or it's always bold and a larger font size, or it is always directly above a thick horizontal line. Is there a way for PDF plumber to search for the text that matches those particular properties?
Bro... you saved me AF... I was trying to get data from a PDF and export that data to .csv or .xlsx and I was about to die trying to solve it :p. BTW I'm very new in scripting. Cheer from Argentina. You deserve millons of likes and suscriptors! (now I'm one of them)
Awesome ! Thanks for such a informative session, can you please continue the nlp lectire on data extractions from diff files like json, webscrapping,and pdf
I do have several on the web scraping, and more PDFs, so check out my other videos! I don’t have too much on JSON but I do have a lot of experience working with that format, so can certainly look at adding some videos on that!
Very informative thanks. I have a problem though in constructing a regular expression where the invoice may have a "Disc Amount" or not. At this time the expression assumes it is always blank.
Great tutorial! I agree, it’s the best tutorial I found d so far when working with unstructured PDF files. I have a question although my docs are not invoices but forms. Some of these PDFs have graphics, i.e., blue circles/elipses to “select” the best answer to some questions. Is there a way to identify these graphics and then select the text enclosed in them? If needed, I can share one of PDF so you can see what I mean. TIA
Amazing video! Thank you for posting! Just curious, is it possible to extract vendor # and invoice # as well, although some is empty in pdf? I have a similar problem here with a Xero journal report. The debit and credit figures are in two separate columns and not sure how to use RE to tell which number is debit, and which is credit.
Yep that’s a challenge but can usually be done. There are ways to use spacing to determine which is from which column but it gets a little tricky sometimes. If you want to send a sample page with any sensitive data redacted, I’d be happy to give it a shot - pythoniccpa@gmail.com
Extremely helpful and easy. I'm looking to extract transaction details from bank or credit card statements for analysis say 3yrs and need to come up with spending pattern or project style of spending in each category. The data in statement contains different boxes and views. Would like to see if you any videos
Can we make something different if not a complete spreadsheet is given to us to extract data from, but if we generate such PDFs upon the transaction and we want to create a script that could directly take those generated PDFs from the chrome browser and manage to gather the name,invoice no.,date and the amount ,amount + gst= total amount,and then format them into a spreadsheet and finally make a sheet which could include the total amount collected at the end of the day or when specificed by the user. Hope it won't be difficult as it looks eagerly waiting for your reply.
Absolutely fantastic video. Quick question I have, what changes would you make to the syntax if the PDF consists of let's say 350 pages or more with the same form of data across the pdf? That would be the biggest help. Thank you
Thanks for the question! The only change id make is to start a list, and iterate through each page in the pdf, appending the resulting extract to the list. That would only hold one pdf page in memory at once, and the list wouldn’t get that big.
Thanks Steven! I learned pretty much everything I know about python through a few Coursera courses, a few python books, and some of the TalkPython trainings. Then it was just a bunch of playing around and having fun. All hands on, and trying to solve problems. Lost of googling also, that’s a programmer’s best friend no matter how experienced. I’ve included the code for most of these videos on github at github.com/danshorstein/pythonic-accountant
excellent series, so useful!!! If you can, do one where someone can upload a pdf of a balance sheet and it returns a data frame of all the entries ( the column names could be "entry type(assets, liabilities...)", "entry name(ppl, inventory, retained earnings)", "value")
I have the same scenario as you with the vendor number and vendor name. How do I get it to print all vendors and not just the one on the last page? How do I use regex if my vendor number and name is separated by a hyphen? For example, 700 - Smith, Joe. Sorry I’m a python newb
Of course, regex doesn’t care what profession it’s supporting :) The biggest factor is whether you are dealing with a scanned PDF or a computer generated PDF. If it is computer-generated then you need to make sure the text you’re trying to capture is not in an image. If the PDF is scanned, or you are trying to capture data from an image, then you will need to use a different approach than just regex, as you’ll need to perform OCR. One way to do that is with Tesseract, but there are other options as well.
In comparison to pdfminer how do you evaluate pdfplumber? Is it possible to draw black boxes on the pdf specific parts by using pdfplumber? As I had a short glance, it seems pdfplumber doesn't write the result into a pdf unlike pdfminer, but it seems it keeps the structure of pdf unlike pdfminer which is very helpful. Any ideas?
For me, in the xlsx file, date are being rendered as ######## with the message of 'possible error loss' in excel. Any idea why? (My dataframe looks completely fine) How shall I resolve, thanks!
Thank you! Perfect explained, easy to follow your code. I have a question, related Hebrew/Arabic languages. When extracting the text, it's reversed. How can I fix it? Thank you!
Not sure if it is needed to do so, as every invoice is actually generated by the data system, so it should be in excel format at the very beginning, why don't we use SQL to generate all the info ?
Awesome Tutorial. Really informative, and I learnt a lot. I have a question, suppose say i have a list of application forms (pdfs/images) in which users fill them by handwritten. say "Name: _________ " Here Name: is going to be in Computer Printed text, but the blank is filled with handwritten text. And there are multiple fields that are need to be extracted with the similar issue. How can i get the the data extracted from these?
Best bet is using OCR, with something like tesseract. Here’s an example of one of many tutorials that can help with that www.pyimagesearch.com/2017/07/10/using-tesseract-ocr-python/
But that ocr (tesseract) is for only optical characters like the digital types text only right? I don't think that can be used for handwritten text detection
Hari Vamsi oh right, that’s different. Much more difficult and I haven’t done much with that, but think tensorflow or PyTorch could be your best bets for building a handwriting recognition model. Something like link.medium.com/QkxEiMcJN6
Thank you so much. I am working with a 100 pages pdf file. I did your codes, but only 5 rows of data from 1 page were exported to the df. all of the pages are extracted. is there any solution?
Wow. find this cool tutorial after struggling with with pricy proprietary app. Btw, how to extract more than one line as single value. for instance there is an invoice with 3 lines in its description. Thank you.
Absolutely doable! Takes a few extra steps, particularly taking some steps to identify what row you are in and when you reach a last row that you need to combine all the data. I will keep this in mind for a future video
Fantastic video! It just solved my work problem easily! But I also tried to extract description which is in a line after all the invoice line items. I am not sure how to get the description and append it as the last column in the DataFrame. i have found no text pattern for the description, cant use re.compile to fetch. Could you tip me on that pls?
You can usually grab descriptions not based on the pattern of the description itself, but the pattern of everything around it. For example if there are three sets of numbers that always show up before the description, then use those three numbers of the pattern to determine that the description is coming next. If you want to send me a redacted example I could try and take a look - pythoniccpa@gmail.com
Teres Lok can have a flag for that. Maybe “in_line_item = False”. When your regex picks up that you’re in the line items change it to “in_line_items = True”. Then each iteration of a new line, of in_line_items AND the regex doesn’t match a line item, you know you’re now in the first line after the line items, which is your description line.
That’s used for tuple unpacking when there are a varying number (0 or more) values in that part of a list or tuple. See the “Asterisks in tuple unpacking” section of this post ( treyhunner.com/2018/10/asterisks-in-python-what-they-are-and-how-to-use-them/#Asterisks_in_list_literals ) for more details
Hi, can u know how to read entire pdf document using pdfplumber.. i guess, we are reading only one page at once .. can't we read entire pdf doc contains like 4 or more pages.. i am unable to do this .. pls can anyone share me the thoughts.
Yes. But pdfplumber is not very efficient (at least in my experience), so it takes a while for a large pdf document. There are other options for extracting text that may work better for larger files
This Reg Expressions is the Army Swiss Knife on steroids. Is it used for Web Scraping in your experience ? Is there another Library you know of that is a jewel like this in the toolbox. I feel like I'm gulping from a River that is flowing with Pure Water.
This Content is Priceless indeed. Thank you for taking time out and create such a resourceful video step by step with detailed narration. However, I have a PDF which does not contain tabular data. It contains text data which I need to convert it into .csv excel tabular form. Can you help me with it please?
@@PythonicAccountant: I have sent you an email for the same, so please do have a look and suggest me the modifications I need to make for solving my problem statement