No video :(

PDF Amazon Textract with Python

Подписаться 15 тыс.

Просмотров 10 тыс.

50% 1

===================================================================
1. SUBSCRIBE FOR MORE LEARNING :
/ @cloudquicklabs
===================================================================
2. CLOUD QUICK LABS - CHANNEL MEMBERSHIP FOR MORE BENEFITS :
/ @cloudquicklabs
===================================================================
3. BUY ME A COFFEE AS A TOKEN OF APPRECIATION :
www.buymeacoffee.com/cloudqui...
===================================================================
This video shows how to extract PDF file's raw data using Amazon Textract with python (boto3 module). It has flow explanation, code line-by-line walk through and demo of running same code to extract the data.
Code used in video can be found at git repo link : github.com/RekhuGopal/PythonH...

Опубликовано:

7 авг 2021

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 34

@ced4298 2 года назад

Great explanation of the code and details of the process!

@cloudquicklabs 2 года назад

Thank you for watching my videos. Keep watching, Happy learning

@rupadevibaskar4652 2 года назад

I really appreciate your work. Helpful one! How can I save the output in text file format and also how to call the API's to get keys.csv file. Thanks.

@cloudquicklabs 2 года назад

Thanks for watching my videos. FYI, ## Read keys of csv file import csv with open('test.csv') as f: reader = csv.DictReader(f) print reader.fieldnames ## To write in txt file lines = ['Readme', 'How to write text files in Python'] with open('readme.txt', 'w') as f: for line in lines: f.write(line) f.write(' ')

@maniraj3237 Год назад

Great Explanation, what if I want to limit the number of pages for AWS textract. where exactly I would need to specify that, as AWS charges basis on number of pages I believe. Thanks

@cloudquicklabs Год назад

Thank you for watching my videos. API 'start_document_text_detection' does not has page number to extracted , hence your requirements could be achieved at output via code. And charging would be according number of API invokes , you could check more at cost documents.

@pearljennifer9996 Год назад

Great video!! Can this technique be used for data capturing of invoices and bills as well??

@cloudquicklabs Год назад

Thank you for watching my videos. Indeed, this technique can be used for data extraction application from invoice and bills

@Felix9204 11 месяцев назад

Thanks for the tutorial. Really has helped me with what I need. I was wondering. On the last line 61 you print out the extracted lines of text. Do you know how I can get each line of text, save that as either JSON or text file and upload the resulting file to another S3 bucket?

@cloudquicklabs 11 месяцев назад

Thank you for watching my videos. Did you check my video at here ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-kzOBNLzpRLE.html

@NoWhiteGullibility 2 года назад

Please add parsing of the confidence values into the table output

@cloudquicklabs 2 года назад

Sure.. I will create new video on it. Thank you for watching my videos.

@RalphDratman Год назад

Thank you!

@cloudquicklabs Год назад

Thank you for watching my videos. Glad that it helped you.

@ahmedsaif4541 Год назад

Hi Sir, thank you for great explanation , what about if i want to pass files in documentName as parameter to detect the objects rather than passing one file ?

@cloudquicklabs Год назад

Thank you for watching my video. Please go through my another video - ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-kzOBNLzpRLE.html which suits your requirements.

@cloudquicklabs Год назад

If this is not enough I could further suggest new method as well.

@abdelazizkamomegna471 2 года назад

Very good explained thank you. Please I would like to know to get the output in JSON format. Thanks.

@cloudquicklabs 2 года назад

Thank you for watching my videos..Apologise for deployed response. Actually the response of API is json, we are only taking actual value out of it.

@MrCrismath 2 года назад

excellent video! How do I automate the process so that it reads multiple pdf files?

@cloudquicklabs 2 года назад

Thank you for watching my videos. When there are multiple . pdf files we need to iterate through code and extract the text. I shall create a new video on it, one question is file names in s3 bucket are same or different list.

@MrCrismath 2 года назад

@@cloudquicklabs Thank you! Another question, is there a way to optimize time? It extract single page pdf text and it takes 7 min approx!!!

@cloudquicklabs 2 года назад

Thank you again for coming back on this. Speed of processing can be increased by batching the jobs.. I would encourage you to look at another video, which helps batching ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-uf8heaG0IhU.html

@shashikumarega3401 2 года назад

hi can you please tell me how to convert doc or docx to pdf in aws lambda with python

@cloudquicklabs 2 года назад

Thank you for watching my videos. I am not sure if you have tried below option. # Python3 program to convert docx to pdf # using docx2pdf module # Import the convert method from the # docx2pdf module from docx2pdf import convert # Converting docx present in the same folder # as the python file convert("GFG.docx") # Converting docx specifying both the input # and output paths convert("GeeksForGeeks\GFG_1.docx", "Other_Folder\Mine.pdf") # Notice that the output filename need not be # the same as the docx # Bulk Conversion convert("GeeksForGeeks\")

@shashikumarega3401 2 года назад

@@cloudquicklabs thanks ,in linux system it is not working

@CarlosAugusto-ng7qr 3 года назад

How to show the result if my pdf is in Table format?

@cloudquicklabs 3 года назад

Thank you for watching my videos, as of now API ('start_document_text_detection ') doesn't support "form['Table']", but you try using other apis which supports it, for example like explained at video ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE--SpHPW3RTx8.html

@qzwwzt 9 месяцев назад

I'm getting this error ClientError: An error occurred (UnrecognizedClientException) when calling the StartDocumentTextDetection operation: The security token included in the request is invalid. The only thing I changed is the s3BucketName and Name parameters. Could you please give some guidance? Do I need to configure something in the AWS console? I did "aws configure" to inform the AccessKeyID and the SecretAccessKey tks

@cloudquicklabs 9 месяцев назад

Thank you for watching my videos. Please check two things here. 1. Check if you have provided correct 'Userkey' and ' Secrete' when you run 'aws configure' 2. It should have required access. 3. Are you using correct client variables in code to invoke Textract APIs.

@qzwwzt 9 месяцев назад

Yes, I"m but i'll duble check it. Tks