Тёмный
No video :(

PDF Amazon Textract with Python 

Cloud Quick Labs
Подписаться 15 тыс.
Просмотров 10 тыс.
50% 1

===================================================================
1. SUBSCRIBE FOR MORE LEARNING :
/ @cloudquicklabs
===================================================================
2. CLOUD QUICK LABS - CHANNEL MEMBERSHIP FOR MORE BENEFITS :
/ @cloudquicklabs
===================================================================
3. BUY ME A COFFEE AS A TOKEN OF APPRECIATION :
www.buymeacoffee.com/cloudqui...
===================================================================
This video shows how to extract PDF file's raw data using Amazon Textract with python (boto3 module). It has flow explanation, code line-by-line walk through and demo of running same code to extract the data.
Code used in video can be found at git repo link : github.com/RekhuGopal/PythonH...

Опубликовано:

 

7 авг 2021

Поделиться:

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист
Посмотреть позже
Комментарии : 34   
@ced4298
@ced4298 2 года назад
Great explanation of the code and details of the process!
@cloudquicklabs
@cloudquicklabs 2 года назад
Thank you for watching my videos. Keep watching, Happy learning
@rupadevibaskar4652
@rupadevibaskar4652 2 года назад
I really appreciate your work. Helpful one! How can I save the output in text file format and also how to call the API's to get keys.csv file. Thanks.
@cloudquicklabs
@cloudquicklabs 2 года назад
Thanks for watching my videos. FYI, ## Read keys of csv file import csv with open('test.csv') as f: reader = csv.DictReader(f) print reader.fieldnames ## To write in txt file lines = ['Readme', 'How to write text files in Python'] with open('readme.txt', 'w') as f: for line in lines: f.write(line) f.write(' ')
@maniraj3237
@maniraj3237 Год назад
Great Explanation, what if I want to limit the number of pages for AWS textract. where exactly I would need to specify that, as AWS charges basis on number of pages I believe. Thanks
@cloudquicklabs
@cloudquicklabs Год назад
Thank you for watching my videos. API 'start_document_text_detection' does not has page number to extracted , hence your requirements could be achieved at output via code. And charging would be according number of API invokes , you could check more at cost documents.
@pearljennifer9996
@pearljennifer9996 Год назад
Great video!! Can this technique be used for data capturing of invoices and bills as well??
@cloudquicklabs
@cloudquicklabs Год назад
Thank you for watching my videos. Indeed, this technique can be used for data extraction application from invoice and bills
@Felix9204
@Felix9204 11 месяцев назад
Thanks for the tutorial. Really has helped me with what I need. I was wondering. On the last line 61 you print out the extracted lines of text. Do you know how I can get each line of text, save that as either JSON or text file and upload the resulting file to another S3 bucket?
@cloudquicklabs
@cloudquicklabs 11 месяцев назад
Thank you for watching my videos. Did you check my video at here ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-kzOBNLzpRLE.html
@NoWhiteGullibility
@NoWhiteGullibility 2 года назад
Please add parsing of the confidence values into the table output
@cloudquicklabs
@cloudquicklabs 2 года назад
Sure.. I will create new video on it. Thank you for watching my videos.
@RalphDratman
@RalphDratman Год назад
Thank you!
@cloudquicklabs
@cloudquicklabs Год назад
Thank you for watching my videos. Glad that it helped you.
@ahmedsaif4541
@ahmedsaif4541 Год назад
Hi Sir, thank you for great explanation , what about if i want to pass files in documentName as parameter to detect the objects rather than passing one file ?
@cloudquicklabs
@cloudquicklabs Год назад
Thank you for watching my video. Please go through my another video - ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-kzOBNLzpRLE.html which suits your requirements.
@cloudquicklabs
@cloudquicklabs Год назад
If this is not enough I could further suggest new method as well.
@abdelazizkamomegna471
@abdelazizkamomegna471 2 года назад
Very good explained thank you. Please I would like to know to get the output in JSON format. Thanks.
@cloudquicklabs
@cloudquicklabs 2 года назад
Thank you for watching my videos..Apologise for deployed response. Actually the response of API is json, we are only taking actual value out of it.
@MrCrismath
@MrCrismath 2 года назад
excellent video! How do I automate the process so that it reads multiple pdf files?
@cloudquicklabs
@cloudquicklabs 2 года назад
Thank you for watching my videos. When there are multiple . pdf files we need to iterate through code and extract the text. I shall create a new video on it, one question is file names in s3 bucket are same or different list.
@MrCrismath
@MrCrismath 2 года назад
@@cloudquicklabs Thank you! Another question, is there a way to optimize time? It extract single page pdf text and it takes 7 min approx!!!
@cloudquicklabs
@cloudquicklabs 2 года назад
Thank you again for coming back on this. Speed of processing can be increased by batching the jobs.. I would encourage you to look at another video, which helps batching ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-uf8heaG0IhU.html
@shashikumarega3401
@shashikumarega3401 2 года назад
hi can you please tell me how to convert doc or docx to pdf in aws lambda with python
@cloudquicklabs
@cloudquicklabs 2 года назад
Thank you for watching my videos. I am not sure if you have tried below option. # Python3 program to convert docx to pdf # using docx2pdf module # Import the convert method from the # docx2pdf module from docx2pdf import convert # Converting docx present in the same folder # as the python file convert("GFG.docx") # Converting docx specifying both the input # and output paths convert("GeeksForGeeks\GFG_1.docx", "Other_Folder\Mine.pdf") # Notice that the output filename need not be # the same as the docx # Bulk Conversion convert("GeeksForGeeks\")
@shashikumarega3401
@shashikumarega3401 2 года назад
@@cloudquicklabs thanks ,in linux system it is not working
@CarlosAugusto-ng7qr
@CarlosAugusto-ng7qr 3 года назад
How to show the result if my pdf is in Table format?
@cloudquicklabs
@cloudquicklabs 3 года назад
Thank you for watching my videos, as of now API ('start_document_text_detection ') doesn't support "form['Table']", but you try using other apis which supports it, for example like explained at video ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE--SpHPW3RTx8.html
@qzwwzt
@qzwwzt 9 месяцев назад
I'm getting this error ClientError: An error occurred (UnrecognizedClientException) when calling the StartDocumentTextDetection operation: The security token included in the request is invalid. The only thing I changed is the s3BucketName and Name parameters. Could you please give some guidance? Do I need to configure something in the AWS console? I did "aws configure" to inform the AccessKeyID and the SecretAccessKey tks
@cloudquicklabs
@cloudquicklabs 9 месяцев назад
Thank you for watching my videos. Please check two things here. 1. Check if you have provided correct 'Userkey' and ' Secrete' when you run 'aws configure' 2. It should have required access. 3. Are you using correct client variables in code to invoke Textract APIs.
@qzwwzt
@qzwwzt 9 месяцев назад
Yes, I"m but i'll duble check it. Tks
@krishradha5709
@krishradha5709 11 месяцев назад
Does it extract tables in table format?
@cloudquicklabs
@cloudquicklabs 11 месяцев назад
Thank you for watching my videos. Indeed it extracts it.
@krishradha5709
@krishradha5709 11 месяцев назад
@@cloudquicklabs thanks a lot for the reply... Like wise do you know any tool that can extract flowchart content and into text?
Далее
Amazon Rekognition with Python
14:34
Просмотров 1,8 тыс.
Amazon Textract with Python
23:22
Просмотров 11 тыс.
Handling AWS S3 Bucket With Python
29:25
Просмотров 12 тыс.