Тёмный

VBA Expert: Reading Scanned PDF's 

Kalkytron
Подписаться 434
Просмотров 33 тыс.
50% 1

Working example:
drive.google.com/file/d/1-vCv...
[updated on 19.01.2021]
Tools: Magick and Tesseract
Tesseract:
github.com/tesseract-ocr/tess...
Variables:
• VBA Beginner: 02. Vari...
Loops:
• VBA Beginner: 03. Loops
Conditionals
/
API's
/
Shell
• Topical: VBA & Shell F...
FSO library
/

Опубликовано:

 

1 окт 2016

Поделиться:

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист
Посмотреть позже
Комментарии : 32   
@Lucian0623
@Lucian0623 7 лет назад
thank you for the video!! keep it up!
@bcapp7937
@bcapp7937 3 года назад
Thanks very much for sharing this. You mentioned that you have a 2.0 version where you put Tesseract and Unar (apparently Magick now) zip files the code, could you share the code/file on this? Thanks!
@KhalilYasser
@KhalilYasser 5 лет назад
Thanks a lot. As for the working example link is not found .. Can you update please?
@miguelangelsorianobueno5816
@miguelangelsorianobueno5816 6 лет назад
Hello Working Example link doesn´t work anymork. Could you fix it please? TY!!!!
@jetzza1995
@jetzza1995 4 года назад
Hi Bart, can we please get the working example link ,this would be very helpfull for us
@danielohlsson3649
@danielohlsson3649 7 лет назад
Hi, thanks for a helpful video. I'm however trying to read scanned documents (I have them as .tif files) that have a few check boxes, and I would like to know which of the check boxes have a check mark in them. Do you know if Tesseract supports this kind of issue? I have read about OMR (Optical Mark Recognition) but I haven't found anything for custom implementation in VBA or Python, which are the languages I know. Thank you for your help!
@kalkytron6385
@kalkytron6385 7 лет назад
Hi Daniel. I haven't used Tesseract with .tiff files. Also, Tesseract doesn't return anything for check boxes when read from a JPG file. But when I google "Tesseract tif" then I immediately get some possibly useful hits. I suggest you try to make it work at the command prompt first based on those articles. Once you've covered that you can move it to the script.
@daytodatainc.2520
@daytodatainc.2520 6 лет назад
Hello, the links are not working. Can you please update?
@romanlight5525
@romanlight5525 6 лет назад
hello can you update example link?
@matewojno2103
@matewojno2103 Год назад
I am trying to ocr scan documents in pol then extract only sepecific boxes and paste or import them to excel table, i have solution in mind but not enough programming knowledge.
@chriscatterall4698
@chriscatterall4698 7 лет назад
Thanks for the really useful video I am trying to recreate this code using MS Excel 2010 on Windows 10. Is there a Window equvilent for Unar and Tessract, or is this solution only for Mac? Thanks again for your help.
@kalkytron6385
@kalkytron6385 7 лет назад
Hi Chris. The video is made on a Windows machine. So Unar and Tesseract definitely work on there. Also, MS Excel 2010 is fine.
@chriscatterall4698
@chriscatterall4698 7 лет назад
Hi Bart, Thanks for your reply. I managed to find the Windows versions of Unar and Tesseract, and have downloaded your example spreadsheet. I had the same problem with the Shell command as Andre. I therefore substituted Andre's code (i.e. Function PDF_To_Txt down to ==================Put all the txt files into 1 & rename======================= .... End Function. I changed the file directory locations for my computer. However I get a runtime error '53' File not found message, and the 'Call Shell(spdfEx & " " & """" & sPDF & """ """ & sTXT & """", vbNormalFocus)' line of code immediately below ==============Convert PDF to TXT====================== is highlighted. I've pasted the end of the code below. Am I missing something obvious? I'd be grateful if you could take a look. Thanks again. Chris Function PDF_To_Txt(sPDF As String) 'Tools -> Reference -> Microsoft Scripting Runtime 'Process: ' - PDF to JPG/TIFF with Unar --> output is 1 picture per PDF page ' - Make sure the pictures are in the correct folder ' - JPG's to TXT's ' - All TXT's into 1 TXT ' - Collect everything in an Output folder 'To do: ' - If picture; don't call Unar, but straight to Teseract ' - Add .exe and dll's to Macro ' - Check whether files already exist before creating them Dim sPath As String Dim spdfEx As String Dim sTesseract As String Dim sUnar As String Dim sTXT As String Dim i As Integer Dim iSlashCounter As Integer Dim iPageCounter As Integer Dim sPDFname As String Dim sNewPath As String Dim FSO As Scripting.FileSystemObject Dim fsoFolder As Scripting.Folder Dim fsoFile As Scripting.File Dim fsoFile2 As Scripting.File Dim oTxt As Object Dim oTxtGet As Object Dim sAPI As String Set FSO = CreateObject("Scripting.FileSystemObject") iPageCounter = 1 If UCase(Right(sPDF, 3)) "PDF" Then Exit Function End If 'Get Path and PDF name from dir iSlashCounter = 0 For i = Len(sPDF) To 1 Step -1 If Mid(sPDF, i, 1) = "\" Then iSlashCounter = iSlashCounter + 1 If iSlashCounter = 2 Then sPath = Mid(sPDF, 1, i) Exit For ElseIf iSlashCounter = 1 Then sPDFname = Mid(sPDF, i + 1) sPDFname = Mid(sPDFname, 1, Len(sPDFname) - 4) End If End If Next i If FSO.FolderExists(sPath & "Output") = False Then MkDir sPath & "Output\" Sleep 500 End If sUnar = sPath & "unar.exe" 'PDF to JPG converter sTesseract = sPath & "Tesseract\tesseract.exe" 'JPG to TXT converter spdfEx = sPath & "pdfEx\pdfExtractor.exe" sTXT = sPath & sPDFname & ".txt" '==============Convert PDF to TXT====================== Call Shell(spdfEx & " " & """" & sPDF & """ """ & sTXT & """", vbNormalFocus) sAPI = FindWindow(vbNullString, spdfEx) i = 0 Do Until sAPI "0" Or i >= 50 'Catch the screen Sleep 50 sAPI = FindWindow(vbNullString, spdfEx) i = i + 1 Loop i = 0 Do Until sAPI = "0" Or i >= 50 'loop until the screen is away Sleep 500 sAPI = FindWindow(vbNullString, spdfEx) i = i + 1 Loop 'Check whether there is something in the file Set oTxtGet = FSO.OpenTextFile(sTXT, ForReading) If FileLen(sTXT)
@kalkytron6385
@kalkytron6385 7 лет назад
Hi Chris, I noticed that the code I uploaded contained more than what I explained in the video. I first call an app to directly read the PDF to TXT. This one doesn't work for scans but goes a lot faster. So only if this fails I go to Unar & Tesseract. In any case, I'm sorry for the confusion. Regarding the error in the shell, I created this video to explain the shell function and what to pay attention to: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-YiHVMF5N9BY.html
@ThePimentajoao
@ThePimentajoao 6 лет назад
Hi there! I find this video very usefull ! Could you please help me by posting the code in answer or update the links plz? Thanks in advance for the help! João Pimenta
@BRVR_
@BRVR_ 2 года назад
Hello, Could you say how can i put a higher resolution in output JPG. files ?I see that in the process from PDF to JPG - VBA does JPG very small and unrecognizable for Tesseract. When VBA tries to recognize the text from JPG. it shows in Command Line: Resolution 98 and mostly gives "abracadabra" result in TXT. I think i should put here some Command like magick convert inpur.pdf -resize 150% -quality 200 output.jpg Call Shell(sMagick & " """ & oFile.Path & """ """ & Left(oFile.Path, Len(oFile.Path) - 3) & "jpg" & """", vbNormalFocus) 'Run Magick: PDF to JPG
@daviddarby3738
@daviddarby3738 6 лет назад
Complex code and no demo?
@drag1c
@drag1c 3 года назад
Hi Kalkytron ! I am using Professional 2013, magick 7.0.11 and tess 5. Is there chance your excel file not work with them? Ofcourse, I changed source parts of your excel file. Also, to point out, I do not have Microsoft Excel 16.0 Object Library. I have Microsoft Excel 15.0 Object Library. The same is with Microsoft Excel 16.0 Office Library. The problem is I have infinite loading and excel take 0 MB of ram. It goes into bug.
@drag1c
@drag1c 3 года назад
I've found out Shell function does not work properly for part: Converting from PDF to JPG. Simply, when I run it, I dont get JPG file. I've tried Part 4 of program on JPG file (manually added JPG file with text into folder) and it works. Do you know how to fix Shell part for PDF to JPG?
@deandog7223
@deandog7223 3 года назад
Hi @Kalkytron, possible that you still might have the working source-code?
@kalkytron6385
@kalkytron6385 3 года назад
Hi DeanDog. Try this link: drive.google.com/file/d/1-vCvJRWg6m6k1d23_fRC2NyQ0q4vsAdP/view?usp=sharing It's an updated version.
@D3_Business_Analytics
@D3_Business_Analytics 6 лет назад
The link is not working dear
@andrejackson6020
@andrejackson6020 7 лет назад
Hi Bart, thanks for the video, i'm having some issues with the shell command. Please help!
@kalkytron6385
@kalkytron6385 7 лет назад
Hi Andre. Can you paste the code here that you are using? And maybe the error you are getting as well
@andrejackson6020
@andrejackson6020 7 лет назад
I found the code that was on the original downloadable version was different to the code that was at the end of the video, maybe i'm wrong. I'm only interested in the Unar and Tesseract parts; when i execute the code (F8), it seems that its skipping the Unar part and is trying to convert the PDF directly to a text file rather than going through the whole process. Thanks in advance. Function PDF_To_Txt(sPDF As String) 'Tools -> Reference -> Microsoft Scripting Runtime 'Process: ' - PDF to JPG/TIFF with Unar --> output is 1 picture per PDF page ' - Make sure the pictures are in the correct folder ' - JPG's to TXT's ' - All TXT's into 1 TXT ' - Collect everything in an Output folder 'To do: ' - If picture; don't call Unar, but straight to Teseract ' - Add .exe and dll's to Macro ' - Check whether files already exist before creating them Dim sPath As String sPath = "C:\Users\Andre\Desktop\Tesseract-OCR\Tesseract\ReadPDFs\" Dim i As Integer Dim iSlashCounter As Integer Dim iPageCounter As Integer Dim sPDFname As String Dim fsoFolder As Scripting.Folder Dim fsoFile As Scripting.File Dim fsoFile2 As Scripting.File Dim oTxt As Object Dim oTxtGet As Object Dim sAPI As Long Dim FSO As Scripting.FileSystemObject Dim oFolder As Scripting.Folder Dim oFile As Scripting.File Dim sFolder As String Dim sUnar As String Dim sTesseract As String Dim sTxt As String Dim sNewPath As String Set FSO = CreateObject("Scripting.FileSystemObject") sFolder = "C:\Users\Andre\Desktop\Tesseract-OCR\Tesseract\ReadPDFs" Set oFolder = FSO.GetFolder(sFolder) sUnar = "C:\Users\Andre\Desktop\Tesseract-OCR\Tesseract/" & "unar.exe" sTesseract = "C:\Users\Andre\Desktop\Tesseract-OCR\Tesseract" & "/Tesseract.exe" iPageCounter = 1 For Each oFile In oFolder.Files sTxt = sFolder & oFile.Name & "convert" sNewPath = sFolder & "\" & oFile.Name & "pdf" '==============Convert PDF to TXT====================== Call Shell(sUnar & " " & """" & oFile.Name & """", vbNormalFocus) 'Run Unar: PDF to JPG sAPI = FindWindow(vbNullString, sUnar) i = 0 Do Until sAPI "0" Or i >= 50 'Catch the screen Sleep 50 sAPI = FindWindow(vbNullString, sUnar) i = i + 1 Loop i = 0 Do Until sAPI = "0" Or i >= 50 'loop until the screen is away Sleep 500 sAPI = FindWindow(vbNullString, sUnar) i = i + 1 Loop 'Check whether a folder is made by unar. If not; make one and copy the jpg to it If FSO.FolderExists(sNewPath) = False Then MkDir sNewPath Do Until FSO.FolderExists(sNewPath) = True Sleep 100 Loop Dim SourceFile, DestinationFile As String SourceFile = sPath & oFile.Name 'Define source file name. DestinationFile = sNewPath & oFile.Name 'Define target file name. 'Copy the jpg to the newly created folder Set oFolder = FSO.GetFolder(sPath) If Mid(oFile.Name, 1, 4) = "Page" Then FileCopy SourceFile, DestinationFile Sleep 500 Kill sPath & oFile.Name Sleep 500 Exit For End If End If Call Shell(Chr(34) & sTesseract & Chr(34) & " " & Chr(34) & sNewPath & oFile.Name & Chr(34) & " " & Chr(34) & sTxt & Chr(34), vbNormalFocus) sAPI = FindWindow(vbNullString, sTesseract) i = 0 Do Until sAPI "0" Or i >= 50 'Catch the screen Sleep 50 sAPI = FindWindow(vbNullString, sTesseract) i = i + 1 Loop i = 0 Do Until sAPI = "0" Or i >= 50 'Loop until the screen is away Sleep 500 sAPI = FindWindow(vbNullString, sTesseract) i = i + 1 Loop Next oFile '==================Put all the txt files into 1 & rename======================= Set oFolder = FSO.GetFolder(sNewPath) sTxt = sNewPath & sPDFname & ".txt" Set oTxt = FSO.CreateTextFile(sTxt, True) 'Create new txt file Do Until FSO.FileExists(sTxt) = True Sleep 100 Loop For Each oTxtGet In oFolder.Files 'Loop through other files and copy text If CStr(oTxtGet) sTxt Then Set oTxtGet = FSO.OpenTextFile(oTxtGet, ForReading) Sleep 1500 On Error Resume Next oTxt.WriteLine (oTxtGet.ReadAll) On Error GoTo 0 oTxt.WriteLine (" _-_-_-_-_-_-_-_-_- " & "Page " & iPageCounter & " _-_-_-_-_-_-_-_-_- ") oTxtGet.Close iPageCounter = iPageCounter + 1 End If Next oTxtGet oTxt.Close 'Copy txt file to an output folder On Error Resume Next MkDir sPath & "Output\" On Error GoTo 0 FileCopy sTxt, sPath & "Output\" & sPDFname & ".txt" On Error Resume Next Kill sNewPath & "*" 'delete all files in folder Do Until Err.Number = 0 On Error GoTo 0 On Error Resume Next Kill sNewPath & "*" 'delete all files in folder Sleep 200 Loop On Error GoTo 0 End Function
@kalkytron6385
@kalkytron6385 7 лет назад
Hi Andre, I don't recommend walking through the code with F8 because the code is trying to find windows and handle them. You might interrupt this by always going back to the VBE. Also, the code you show starts with running Unar. So that seems OK.... I do have a version in which I first try to read the PDF with a different console app (works in case of eg Word files saved as PDF). If this fails then I go for Unar and Tesseract. In any case, I don't think that I have this in the uploaded version on Dropbox.
@jeronimo6159
@jeronimo6159 Год назад
Hi do you do paid work? If so how do I contact you
@Eric-pi2rn
@Eric-pi2rn 5 лет назад
Hi, Anyone with a working version of this? I tried by manually copying everything, but had no Luck ... This would be exactly what I Need -.-
@kalkytron6385
@kalkytron6385 3 года назад
Hi Eric. Here is an updated version: drive.google.com/file/d/1-vCvJRWg6m6k1d23_fRC2NyQ0q4vsAdP/view?usp=sharing
Далее
best way out of the labyrinth🌀🗝️🔝
00:17
Просмотров 1,9 млн
ЮТУБ БЛОКИРУЮТ?
01:52
Просмотров 862 тыс.
VBA Expert: Handling pop-ups with APIs
12:04
Просмотров 8 тыс.
Stop, Intel’s Already Dead!
13:47
Просмотров 158 тыс.
PDF to Excel Converter in Excel VBA
8:18
Просмотров 112 тыс.
How to Use Arrays Instead of Ranges in Excel VBA
10:20
Просмотров 197 тыс.
Open PDF File In Userform Excell VBA
7:43
Просмотров 41 тыс.