NEW GPT-4o Vision API: Best Way to Copy Text from Image (OCR in Python)

Подписаться 3,2 тыс.

Просмотров 7 тыс.

50% 1

OpenAI has released a new model, GPT-4o with Vision capabilities built right into its API. It is advertised as more accurate, faster and half the cost of the vision capabilities in the previous model. In this video we put that to the test, and try out using a python script to extract text off invoices (even handwritten ones). Also, I will show some tricks to get consistent output from the API for different types of images.
GitHub Link to Starter code:
github.com/AI-...

Опубликовано:

9 сен 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 26

@pjgilcunha 3 месяца назад

Totally agree with the Llama comment at the end: every company is going to want to build their own model (trained on basic open source libraries data and with their own data on top of it). I still struggle with understanding how that new world will look like... A bunch of "Jarvis" everywhere? Can you make a video of what you think interacting with that new internet ecosystem may look like? Thanks!

@aiunleashed509 3 месяца назад

I've been involved in a big project recently, where it was proposed to use GPT4 (vision and other capabilities) in Azure, and the legal dept immediately rejected because it would mean exposing customer data to AI. However, that data was already in other Azure data storage products...so basically they trust Microsoft with their most sensitive data up to the point AI is involved. Which doesn't make a lot of sense....but I think because it's so new a lot of CYA going on. Nobody wants to be in a position "we built this awesome AI app, but then it hallucinated all our data to hackers". I think that will evolve over the next few years. In the short-term llama will get a lot of interest, especially for proof of concepts in enterprise sandboxes. Thanks for the video suggestion, will have to think hard about that one (BIG topic, that I think most people will get wrong tbh)

@pjgilcunha 3 месяца назад

@@aiunleashed509 I guess we struggle with the same: no one can see the future. Sorry for being just the latest to ask you this... 😂 Keep up with the videos. I like the fact you teach useful stuff in a fast and objective way.

@micbab-vg2mu 3 месяца назад

thank you for the video- GPT4o is my default model at the moment - but I test other LMMs as well -)

@aiunleashed509 3 месяца назад

How are you finding GPT 4o? For vision its a no brainer, but I have heard mixed reviews for other use. Personally I have found it better for audio conversations and about the same as GPT4 turbo for everything else. Thanks for watching!

@mohamedalichakroun6967 Месяц назад

Thanks MEN !!! it's so helpful, I planified to work for my own project with GPT 4 o for extraction and I'm wondering about the billing when use multiple image in single API For json extraction?

@aiunleashed509 Месяц назад

Hey! Thanks. the way it works is you buy prepaid credits from OpenAI. Then you are charged by the tokens required to read the image and return the JSON response. The biggest variable is the resolution of the image. On this page OpenAI provides a calculator: openai.com/api/pricing/

@ArabmilitaryNews Месяц назад

very helpful, thanks!

@chrisder1814 2 дня назад

Hello, can this remove the background from the images that are in my database?

@aiunleashed509 2 дня назад

No it doesn't have that capability yet. There are API offerings like remove.bg that do it but cost is a bit much in my opinion. I would try to build something like this: www.reddit.com/r/AZURE/comments/qsaa3x/creating_an_image_background_removal_api_removebg/

@alissonpina5339 27 дней назад

Good job! 👏I have a question: if I want to send, for example, 4 images, what would be the correct way to do it?

@aiunleashed509 22 дня назад

you can pass multiple images at once, see example: community.openai.com/t/gpt4-v-the-order-of-multiple-image-inputs/519966 the problem is you can't tell which response is for which image (they are not indexed the same way they are submitted). Not sure there is a solution to this yet (let me know if you find one). The workaround is just looping through the images and getting responses one at a time.

@rajmandaviya858 2 месяца назад

Hey, I was doing the same thing before i found your video but adding response_format was a great help. Thanks! Now i am finding a way where i send multiple images to gpt4o and get an indicator if image is rotated(it does not work on rotated images) now when it comes to multiple images i need an identifier of them to rotate required image only, Do you have anything in mind?

@aiunleashed509 2 месяца назад

Hey! Glad you found it helpful. For rotating an image, I would actually just put in a pre-processor step to do this without gpt4o. Try the auto-rotation and deskew functionality in something like tesseract. I just say that because its already a solved problem, and you could do it for free without AI, and I'm not sure the models would be that reliable

@devamsanghavi1787 2 месяца назад

Thank you so much! I still encounter some issues like I'm uploading an Invoice and every time it gives different vendor name(upper case, lower case) and how to mention date format in JSON Schema? it always return different format. how can I prompt this?

@aiunleashed509 2 месяца назад

For the date format, you can prompt for it, I have had good luck. For the json schema, when the "format" is set to "date" to specify that the string should be in the date format according to ISO 8601 (YYYY-MM-DD). I think you can also specify pattern for different format. I haven't seen the lower case upper case issue. Only idea would be make it so each word had capital letter in python in post processing (for consistency), although names that are all uppercase would be a problem...

@xmagcx1 3 месяца назад

I tried to use the vision functionality, but unfortunately sometimes it invents the numbers and even if I force it in the prompt it doesn't do it :(

@aiunleashed509 3 месяца назад

Hi! can you give me an example, I will try it out

@danielalbano 3 месяца назад

Hi. Do you know the limit of tokens i can use? Im trying to transcribe an image with a lot of text, but it it stops in the middle. it seems the maximum of tokens i can use is around 1000.. How can i set more tokens per request?

@aiunleashed509 3 месяца назад

at 2:30 in the video where I set max_tokens, the limit for GPT-4o is 128,000. try a few thousand first to see if it transcribes more of the text

@JAYJang-me2zh 3 месяца назад

Hello! Thank you for the insightful video. I am currently working on a side project using GPT-4 to extract handwritten text from paper. However, since handwritten text varies greatly and some handwriting can be very difficult to read, there are occasional extraction errors that could affect the product's credibility. I am considering implementing a method where (1) the confidence level of each extraction is assessed, and (2) if the confidence level falls below a certain threshold, (3) the result is marked as N/A or skipped. However, I am in first step using GPT to make product, as I am a product manager, not a developer. Do you have any advice on how to handle this issue? Thank you once again for your helpful video. Best regards, From Korea

@aiunleashed509 3 месяца назад

Hi thanks for watching! In the past I have done projects with OCR where the output data file would give a confidence score on every word, and every character in every word. I think this might be overkill now with AI vision. I think you are on the right track with your process. I would add that if its possible to create validation functions for any of the data (such as checking in a know data value matches a pattern or existing data in a DB) this can help a lot in determining if the OCR is correct. feel free to email for more advice

@E79ric 2 месяца назад

Hi, Thanks for this video on using the GPT-4o Vision API. I'm using the code shown to detect text in images, and it's working very well. However, when I request the pixel coordinates for sections of the invoice (general information, product details, and payments), the accuracy is not very good. Could you provide some advice or demonstrate how to improve the accuracy of the pixel coordinates for each section in the image? I need to locate specific areas like the invoice number, client information, tax ID (CIF or NIF), product details, and payment information such as the total amount and VAT. Thanks in advance for any help!

@aiunleashed509 2 месяца назад

Hey! I think current general purpose AI models with vision will struggle with this kind of zonal OCR. Have you tried Google Vision? it has more established history and might work with giving coordinates. I will check it out for you though, see what my results are.

@E79ric 2 месяца назад

@@aiunleashed509 Perfect!

@E79ric 2 месяца назад

@@aiunleashed509 Thanks for the suggestion. I really appreciate your willingness to check it out for me and look forward to seeing the results you get.