Тёмный

Training Tesseract 5 for a New Font 

Gabriel Garcia
Подписаться 698
Просмотров 41 тыс.
50% 1

Build Tesseract from source video:
• Building Tesseract 5 f...
GitHub repository link:
github.com/astutejoe/tesserac...
Training command:
TESSDATA_PREFIX=../tesseract/tessdata make training MODEL_NAME=Apex START_MODEL=
eng TESSDATA=../tesseract/tessdata MAX_ITERATIONS=10000
Correction: I believe the box file contains the bounding box (OBB) coordinates of the character within the image

Наука

Опубликовано:

 

25 сен 2022

Поделиться:

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист
Посмотреть позже
Комментарии : 174   
@donjuanpond1
@donjuanpond1 18 дней назад
thank you so much man. I've been looking everywhere for a tesseract tutorial, it all just points to the shitty unreadable docs. Without you I don't know where I'd be
@taylorbarnes6151
@taylorbarnes6151 Год назад
God I love you. I just recently started messing with OCR's, specifically Tesseract, and I was reading through some documentation on the steps and after a few hours just wanted to end my life hahahaha. Thank you for this, this is extremely encouraging. I can't wait to try this!
@AchievementHuntGuru
@AchievementHuntGuru 17 дней назад
This video on training is the only source that by following this you will be able to achieve results! Many thanks for this video!
@fivalt126
@fivalt126 3 месяца назад
Estuve rompiendome la cabeza tratando de entender el tutorial oficial y tú lo explicas de una manera sencilla. Soy tu suscriptor numero 666, Muchas Gracias.
@wojd_
@wojd_ Год назад
Great tutorial. Using WSL I was constantly getting new errors. Switching to OS installed on VirtualBox solved it. I was able to train my dataset-it's surprisingly easy.
@heetshah9394
@heetshah9394 9 месяцев назад
Could you help me with the directory structure. I am a bit confused on how it is made?
@45545videos
@45545videos Год назад
Haven't watched the video yet, but if this works, you'll have my eternal gratitude
@buny0n
@buny0n 5 месяцев назад
Tesseract's documentation is abysmal.
@nikolaikrot8516
@nikolaikrot8516 4 месяца назад
I tend to think about tesseract documentation as the Augean Stables
@ganeshrajv130
@ganeshrajv130 Год назад
If I have the line wise hand written image for any language with bounding box and the words so and so can I train it on this LSTM network ? will it work ? and could you share your thoughts on the backbone of LSTM architecture with the flow diagram says : how fonts is helping with training data
@yichenyao5927
@yichenyao5927 4 месяца назад
I think the reason why the word error rate is high is because the font doesn't distinguish uppercase with lower case (it's all upper case) but the ground truth label distinguish between the two.
@3ombieautopilot
@3ombieautopilot Год назад
Thank you for making this video. But I can't wrap my head around where to put all those data files to? I'm trying to fine tune variations of letters with accents, and I'm helpless.
@madhavpandey30
@madhavpandey30 Год назад
Hey Gabriel, I am following your steps to train on my model on hand writtent text. But it is always failing with this erro: unicharset_extractor --output_unicharset "data/Apex/my.unicharset" --norm_mode 2 "data/Apex/all-gt" Failed to read data from: data/Apex/all-gt Wrote unicharset file data/Apex/my.unicharset Can you please help me here? I am stuck. Thanks!
@gyeongwango5434
@gyeongwango5434 8 месяцев назад
I want to train tesseract with an image file I have (consisting of several lines of text), but I'm not sure how to go about it, starting with creating the train data. I'd really appreciate your tips (URLs for reference, etc).
@ConfusedProgrammer
@ConfusedProgrammer 6 месяцев назад
I've been experimenting with this tutorial for three days , the file structure and the GitHub doesn't necessarily match, can you please update the repo if possible . I am having too many folder inconsistencies when trying to to connect the dots here as it was brushed over really quickly , thank you :)
@user-wi7pn5mw1c
@user-wi7pn5mw1c 5 месяцев назад
Thank you for doing this tutorial. Can I use the Text2Image approach to generate box files and tif files to train new font for Tesserat 4.0?
@Leo-hk7kk
@Leo-hk7kk 9 месяцев назад
I want to custom train Tesseract 5 to read the license plates of the car which are detected using YOLO model. How can I do these as I have couple of thousand images? Help What are the steps I need to follow?
@ganeshrajv130
@ganeshrajv130 Год назад
I tried with this font for hindi language ( Kruti Dev 010 ) even tried with Kruti Dev 016 but its showing : Error: Call PrepareToWrite before WriteTesseractBoxFile!!
@ganeshrajv130
@ganeshrajv130 Год назад
the title is for new font , can I take it as new language ? using TIFF
@azadehpedram7215
@azadehpedram7215 5 месяцев назад
I have bunch of plate with some text on it , goal is change the image to text, special font is trained but not effective , how can i train tobetter result, thanks for help
@DalvinderKaur-iz5sn
@DalvinderKaur-iz5sn Год назад
.lstmf files are missing. please help me to where i am wrong.
@wonkduck4759
@wonkduck4759 9 месяцев назад
Hi Gabriel! Thank you so much for the video. A question I had was where did you upload your apex legends ttf file in the code directory like where should it be place? I have a custom font ttf file that I want to train on
@rcraftg4mer42
@rcraftg4mer42 7 месяцев назад
did find any answers?
@ganeshrajv130
@ganeshrajv130 Год назад
I tired with your font but didnt work its throwing like :: Could not find font named 'Arial Unicode MS Regular'. Pango suggested font 'Liberation Mono'. tried with arial but didnt work
@YashhBhushan
@YashhBhushan 28 дней назад
Buddy i need help i need to learn this software but im absolutley clueless any sources tutorils and videoa i can watch
@PratibhaVaradkar
@PratibhaVaradkar Год назад
Hi Gabriel (@AstuteJoe), thank you for the elaborate tutorial. I have a doubt though, once i followed the tutorial, generated the tif, gt.txt and .box manually. My training quits with a zero error rate before the max iterations. But when i use the generated trainneddata file, it gives the error "Error: Tesseract (legacy) engine requested, but components are not present in /use/share/tesseract-ocr/5/tessdata/lang_name.traineddata!! Failed loading language 'lang_name' Tesseract couldn't load any languages! Could not initialize tesseract." Can you please suggest what i missed?
@umandadikwatta178
@umandadikwatta178 Год назад
Thank you very much for this. One question. Can we train Tesseract with non unicode fonts using the same process?
@AstuteJoe
@AstuteJoe Год назад
I'm pretty sure, as long as text2image works correctly. If text2image doesn't work correctly you can either come up with another clever ways (like Python scripts) of automatically generating ground truth data (.gt.txt, .box and .tif files), or worst case, create them manually.
@ganeshrajv130
@ganeshrajv130 Год назад
can we train the tesseract without any font ? if not why cant we ?
@eusebiosouza2252
@eusebiosouza2252 9 месяцев назад
Great Video ! I'm getting this error when i try do run the training command: "Failed to read boxes from data/FE_Font-ground-truth/eng_16.tif" The file eng_16.tif not seems to be empty and it's very similar to all other trainning files. Im running with MAX_ITERATIONS=100 and with i delete the file that seems to be the problem, tesseract would throw the same error but with a different file. Does anyone could please help me ?
@DalvinderKaur-iz5sn
@DalvinderKaur-iz5sn Год назад
when tesseract training is start it show the bellow warning Can't encode transcription: 'पिए वई। ज़ख़मनि जो सूर वधंदो वियो हू चीखन्दो for Sindhi how I can handle this problem?
@AmphibianDev
@AmphibianDev Год назад
Hi, I am having issues with the last make training command. It throws out a error "No module named 'PIL'". I have the Pillow library install but the error is still there. I am trying to solve this issue for a long, long time. If you know something I will appreciate the help. I wanted link to my github issue but I am afraid youtube doesn't allow link.
@mohammadmn7364
@mohammadmn7364 6 месяцев назад
Hey, long time passed, But for others having the same issue, creating an virtual env and then installing requiremnets.txt (of the tesstrain repo) in it may fix the issue, at least for me it worked! also check if all txt files have related box files or not!
@ivanmongebadilla9454
@ivanmongebadilla9454 Год назад
Thanks for the tutorial Gabriel. I wanted to ask how could I do this process if I have the images in text? I guess I need to do the .txt file and the .box file and then just run the training command. Do you know any software that I could use to create the .box file from the images I have? Thanks in advance!
@AstuteJoe
@AstuteJoe Год назад
I have seen people use the jTessBoxEditor: vietocr.sourceforge.net/training.html
@ivanmongebadilla9454
@ivanmongebadilla9454 Год назад
@@AstuteJoe one more question, how would you use the newly trained model in python? Thank you
@AstuteJoe
@AstuteJoe Год назад
@@ivanmongebadilla9454 I think just a parameter lang='your_new_model_name' as long as the new model is in the tessdata folder
@heetshah9394
@heetshah9394 9 месяцев назад
Is it necessary for the box_file to be for each character or is it okay for it to be one word per bounding box?
@Bobo-wl6bs
@Bobo-wl6bs Год назад
Hi Gabriel. I came across Tesseract today. I'm curious will I be able to train it to learn an arabic font?. I have a bunch of pdfs which are written in an indigenous language. The idea here is to train it on some sample pages so that it will be able to read it. It includes diacritics so I'm not sure if it will work.
@AstuteJoe
@AstuteJoe Год назад
Check the comments, a bunch of people train it for this exact intent
@ganeshrajv130
@ganeshrajv130 Год назад
one last question to shoot up, basically the Tesseract is not trained with handwritten text I guess and its trained on line files of system text which again converted to images on line basis for training. ? is my assumption true ?
@dhirazz
@dhirazz Год назад
Hey, It seems like you were also looking to train tesseract with handwritten text. Did you do it? If so please shade light, I am so lost
@ganeshrajv130
@ganeshrajv130 Год назад
@@dhirazz training is not an easy thing as you need huge amt of data and they as well clearly said training is not gonna make any sense ( google ) hence,if u wanna try adjusting the parameters then deep dive into cpp
@listentomusicfeellikehome
@listentomusicfeellikehome 2 месяца назад
Hi.I try this on colab. I install tesseract and go on to run split_training_text.py and get this error FileNotFoundError: [Errno 2] No such file or directory: 'text2image'. Is there a solution?
@hoangcuong9521
@hoangcuong9521 5 месяцев назад
Thank you for making this video. It helps me a lot. But I have a problem that when I copy and replace link to save dir or language_code..training_text, it appears that all of those generated image are white blank images. Pls help me out of this :
@aayushjain7793
@aayushjain7793 Год назад
While running the script 'split_training_text.py'. I am getting the following error: Fontconfig warning: "/tmp/fonts.conf", line 4: empty font directory name ignored Could you help me how to resolve this?
@jayrigger7508
@jayrigger7508 Год назад
I am also getting this.. running as sudo helped a bit still getting this "Unable to open '../tmp/fonts.conf' for writing: No such file or directory"
@jayrigger7508
@jayrigger7508 Год назад
just top add.. I am getting eng_XX.box f eng_XX.tiff and eng_xx.gt.txt
@aayushjain7793
@aayushjain7793 Год назад
@@jayrigger7508 I have resolved the issue by just changing the --font flag to /usr/share/fonts
@NotFlashYT
@NotFlashYT Год назад
How do you get suggestions in your terminal for auto completion of commands.
@AstuteJoe
@AstuteJoe Год назад
fishshell.com/
@shadyas.1571
@shadyas.1571 11 месяцев назад
Hi Gabriel. Thank you for this tutorial. I was trying to run the code but I'm receiving this error: Fontconfig error: Cannot load default config file: No such file: (null) This error appears to be font-related. I've experimented with several fonts but I'm unable to resolve this issue. Could you help me please?
@kavachek2
@kavachek2 9 месяцев назад
такая же проблема
@pauliusliaudenskas9269
@pauliusliaudenskas9269 6 месяцев назад
Have you been able to figure it out? I'm having the same problem
@kavachek2
@kavachek2 6 месяцев назад
@@pauliusliaudenskas9269 к сожелению, не смог. Не понимаю, как это сделать
@DalvinderKaur-iz5sn
@DalvinderKaur-iz5sn Год назад
when i run the training command, its gives me the bellow error Segmentation fault (core dumped) tesseract "data/Apex-ground-truth/eng_62.tif" data/Apex-ground-truth/eng_62 --psm 13 lstm.train Makefile:262: recipe for target 'data/Apex-ground-truth/eng_62.lstmf' failed make: *** [data/Apex-ground-truth/eng_62.lstmf] Error 139 Can you help me to fix this?
@xzerozdead
@xzerozdead Год назад
Your folder was probably named "Apex" and not "Apex-ground-truth"
@snoopi6243
@snoopi6243 Год назад
Is there any way to perform RTL languages/fonts fine tuning in windows just like this?
@physicfor
@physicfor 12 дней назад
On windows text2image will never find the font name so better install some lnx vertual machine
@mukilanru
@mukilanru 20 дней назад
I want to be able to OCR '±' which is being detected as '+'. tesseract 5.4.0.20240606 pytesseract 0.3.10 python 3.12
@KINGERTADC_yay
@KINGERTADC_yay Год назад
Hey Gabriel, nice vid, I am actually using it to train tesseract on Aurbesh font/language from star wars look it up it would explain a lot, each letter has a corresponding English letter I have collected roughly 100,000 sentences using your program and trained it with the command you provided but when I run a 6 letter word it completely melts down and just outputs the incorrect answer, I have changed iteration to small and big but no luck, I am wondering if you can help me or point me in the right direction. Thanks a lot
@ganeshrajv130
@ganeshrajv130 Год назад
Hey you collected font but whats the training text data is that of Aurbesh ?
@kinderpinguiin7064
@kinderpinguiin7064 Год назад
Hi ! I don't know if you resolved your issue since 1 month but don't forget to set a huge MAX_ITERATIONS to the make training. I personally set it to 10000 and it was quite better, it might be really enough for you if you have 100000 sentences. If you want to know the result check the log while the model is training, for example : At iteration 7800/7800/7800, Mean rms=5.642000%, delta=49.022000%, BCER train=97.817000%, BWER train=100.000000%, skip ratio=0.000000%, New best BCER = 97.817000 wrote checkpoint. BCER is the error rate for characters and BWER the error rate for words, you can see that at iteration 7800 it was higher than 95% and after the 9500th iteration I got several improvements.
@akshatjain2925
@akshatjain2925 6 месяцев назад
hi when u say we are using text2image nothing AI, but the text2image must be also some model only right ?
@rabbitpiet7182
@rabbitpiet7182 5 дней назад
This comment isn't ai
@rabbitpiet7182
@rabbitpiet7182 5 дней назад
I mean it's not rendered with ai
@insidethoughts502
@insidethoughts502 Год назад
Is tessaract 5 can helpful for detecting only bold text from images
@AstuteJoe
@AstuteJoe Год назад
Only experimentation will tell, but Tesseract 5 does perform better some times
@sebastianorzechowski4613
@sebastianorzechowski4613 4 месяца назад
Helloo is there anyone who tried to learn tesseract polish signs !. I have adjusted this split_training_text for Tesseract 5.0 to create lines of polish set and then teach tesseract. Problem is with font type i think, cause it should know how to recognize those special characters: Stripped 4 unrenderable word(s): 'unieważnienie SZKOŁAMI NADZIEJĘ, | ' I can share my adjusted script to generate those lines with you if you want. I will try with another font. I tried HvDTrial Fabrikat Mono
@monctrikblitz5674
@monctrikblitz5674 4 месяца назад
When running your python script, an error occurs: Fontconfig error: Cannot load default config file Fontconfig error: Cannot load default config file Could not find font named 'Waukegan LDO Bold'. Please correct --font arg. How can I solve this error? I need to use my unique font "Waukegan LDO Bold.ttf" I hope you can help me to solve this problem, thank you in advance.
@sebastianorzechowski4613
@sebastianorzechowski4613 4 месяца назад
I think that you should install this font in your system first :)
@umandadikwatta178
@umandadikwatta178 Год назад
Hello, Can you please explain how to debug the Tesseract code, to get an idea on how the code works ?
@AstuteJoe
@AstuteJoe Год назад
Honestly, I think your best bet is cloning the GitHub repo, readings the docs and then delving onto code, just reading it, eventually you'll be better at knowing where to look and after trying hard you might be comfortable and understand it. And I'm pretty sure in the docs you can dump and inspect some intermediary steps debug files, finally, be sure to run it on verbose mode, probably -v. Ah, and you can compile it with debugging symbols too, should help if you want to set breakpoints etc
@IshaqKhan010
@IshaqKhan010 Год назад
Brother you can train for urdu nashtiliq font there no accurate trained data on net please
@nilor7550
@nilor7550 Год назад
I didn't understand how to run the training command after downloading the two folders from github. I have Windows system
@physicfor
@physicfor 12 дней назад
It will never work for windows
@legendevent3911
@legendevent3911 Год назад
Hey Gabriel, I have a training_text file with just digits like 1,234,567 in variety combinations. The Problem ist when I try to start your script i get following error message: python3 split_training_text.py Traceback (most recent call last): File "split_training_text.py", line 12, in for line in input_file.readlines(): File "/usr/lib/python3.8/codecs.py", line 322, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte Could you help me to resolve this? Im a newbie in python. The tutorial was great! Edit: When im changing the script to: with open(training_text_file, 'rb') I get a new error TypeError: write() argument must be str, not bytes
@AstuteJoe
@AstuteJoe Год назад
Can you send me the whole file? Pastebin or GitHub does it, I believe I know exactly how to fix but I need the whole file to send you the fixed version
@abdeldjalilchougui
@abdeldjalilchougui Год назад
Did you solve the problem ? if yes could you share it with me please ?
@abdeldjalilchougui
@abdeldjalilchougui Год назад
@@AstuteJoe Did you solve the problem ? if yes could you share it with me please ?
@sebastianorzechowski4613
@sebastianorzechowski4613 4 месяца назад
I think you have to type encoding='utf-8' insine open function: with open(training_text_file,'r',encoding='utf-8') as input_file:
@Bengeljo
@Bengeljo 5 месяцев назад
I always get an error when I want to use a font, it is installed and can be find by windows and even looking it up works perfectly. When I run the split_training_text.py I get the following Error: Fontconfig error: Cannot load default config file: No such file: (null) Fontconfig error: Cannot load default config file: No such file: (null) Could not find font named 'Quadrant'. Pango suggested font 'Cascadia Code'. Please correct --font arg. I want to train the model on Quadrat-Serial-Regular.ttf but it just won't regognize it. I tried to look it up but can't find it. Modifying the font flag doesn't help since it wants a name but it can't find it even tho it is there, but tbh I don't know where it is searching for the fonts. The Folder is located on the SSD E: and the operating system is on C: but tesseract and python are in the path of C: so they should get access to it. Please help
@TheComputerChip
@TheComputerChip 4 месяца назад
Having the same problem. Still trying to understand what it is looking for...
@Bengeljo
@Bengeljo 4 месяца назад
@@TheComputerChip I gave up, looked at another method that uses the Google colab and create my own model there it works pretty well. Don't know the video anymore cause probably between then and now I watched approximately 250 vids. Not kidding I don't have a life
@TheComputerChip
@TheComputerChip 4 месяца назад
@@Bengeljo hahaha no worries. I actually ended up getting this to work. The error doesn’t seem to affect the output oddly enough. As long as it finds the font everything still runs. Currently waiting as my PC generates the images and then I’ll sleep as it trains. On video #3 since starting the image creation! lol
@ROHIT_S_Patil
@ROHIT_S_Patil Месяц назад
​@@Bengeljo Can you share the Google Colab workflow you followed to create your model?
@DalvinderKaur-iz5sn
@DalvinderKaur-iz5sn Год назад
Thanks for the tutorial Sir. I have a error after run the Training command-TESSDATA_PREFIX=../tesseract/tessdata make training MODEL_NAME=Apex START_MODEL= eng TESSDATA=../tesseract/tessdata MAX_ITERATIONS=10000. the error is : "CMakefile:325: recipe for target 'data/foo/checkpoints/foo_checkpoint' failed". And coding of string failed! Failure bytes.... ..Can't encode transcription: .....Please can you help me regarding these issues?
@DalvinderKaur-iz5sn
@DalvinderKaur-iz5sn Год назад
MODEL_NAME=foo
@cryptoplusone3850
@cryptoplusone3850 Год назад
does this also work on windows or do i have to use a different method?
@AstuteJoe
@AstuteJoe Год назад
I believe it works, but definitely not every step exactly like in the video. But as far as I remember the Tesseract mantainers highly recommend Linux instead
@focusofLandD
@focusofLandD Год назад
I tried on Windows, not working very well, pls let me know if you are able to solve it
@adityanjsg99
@adityanjsg99 Год назад
So far, the only tutorial on Tesseract 5, the old model of training by bash has been abandoned since December 2022
@faint.2396
@faint.2396 Год назад
So, are you saying this video is now not useful at all?
@ManuthVANN
@ManuthVANN 6 месяцев назад
Thank so much sir for ur clear explaination and code
@farazsoftinfo
@farazsoftinfo Год назад
Hi Gabriel, Thanks for making this tutorial, I was waiting for it. I will start taring my model soon. 😍 But how we can fine-tune a model? Can you please show me how can I combine this new trained file with another model?
@AstuteJoe
@AstuteJoe Год назад
Glad you liked it! In this tutorial you can see I actually fine-tuned, I started on the eng.traineddata file from Tesseract and trained it further on a new font, this should be enough for most cases.
@farazsoftinfo
@farazsoftinfo Год назад
​@@AstuteJoe Hi Gabriel, when I fine-tune I get a very bad result. I just wanna add some new words and some characters, but the final file that I get is worse than the main traineddata file. I'm trying to fine-tune an RTL language. Thanks a lot.
@AstuteJoe
@AstuteJoe Год назад
@@farazsoftinfo That's a very different rabbit hole, that's ML techniques, you might be overfitting (training too much) or underfitting (training too little) your model, have you tried generating all the 193k PDFs to train and leaving it to train for a bit?
@gabriel2011gabriel
@gabriel2011gabriel Год назад
@@farazsoftinfo I'm trying to do the same thing and the result is a bunch of "mmmoooomom...". Is yours the same?
@farazsoftinfo
@farazsoftinfo Год назад
​@@gabriel2011gabriel I tried it for Persian, but I couldn't get a good result. The main models are still better than what I got. When I try to add some new words and fonts I get a worse model. Maybe I should check it more to figure out the best settings that work for the RTL languages.
@Ethiopic
@Ethiopic 11 месяцев назад
Thank you for this video. I am now able to train Tesseract to ocr my language data in the Mac. This is working great both in the Linux and the Mac. (But, I am unable to do so because I am getting error "tessdata_prefix not recognized" in the Windows. )
@wonkduck4759
@wonkduck4759 9 месяцев назад
Hello, I am currently stuck. Where did you upload your new font ttf file in the code directory like where should it be place? I have a custom font ttf file that I want to train on?
@alirezanadafy9267
@alirezanadafy9267 8 месяцев назад
Hi Just run: set TESSDATA_PREFIX="../tesseract/tessdata" and then run the text2image....
@PsychologicalHeat
@PsychologicalHeat Год назад
I am reciveing this error when I try to run your command: Failed to read boxes from data/myFont-ground-truth/eng_45.tif Error during processing. make: *** [data/myFont-ground-truth/eng_45.lstmf] Error 1 TESSDATA_PREFIX=../tesseract/tessdata make training MODEL_NAME=myFont START_MODEL= eng TESSDATA=../tesseract/tessdata MAX_ITERATIONS=100 I have added eng.traineddata to tessdata. Can you help me fixed it please?
@AstuteJoe
@AstuteJoe Год назад
Did you generate the .box files successfully?
@PsychologicalHeat
@PsychologicalHeat Год назад
​@@AstuteJoe I cleaned the box files but now I get a different error Here is my output: + tesseract data/myFont-ground-truth/eng_2.tif data/myFont-ground-truth/eng_2 --psm 13 lstm.train read_params_file: Can't open lstm.train + tesseract data/myFont-ground-truth/eng_0.tif data/myFont-ground-truth/eng_0 --psm 13 lstm.train read_params_file: Can't open lstm.train + tesseract data/myFont-ground-truth/eng_5.tif data/myFont-ground-truth/eng_5 --psm 13 lstm.train read_params_file: Can't open lstm.train + tesseract data/myFont-ground-truth/eng_7.tif data/myFont-ground-truth/eng_7 --psm 13 lstm.train read_params_file: Can't open lstm.train + tesseract data/myFont-ground-truth/eng_3.tif data/myFont-ground-truth/eng_3 --psm 13 lstm.train read_params_file: Can't open lstm.train + tesseract data/myFont-ground-truth/eng_1.tif data/myFont-ground-truth/eng_1 --psm 13 lstm.train read_params_file: Can't open lstm.train find -L data/myFont-ground-truth -name '*.lstmf' | python3 shuffle.py 0 > "data/myFont/all-lstmf" Error: missing ground truth for training make: *** [data/myFont/list.train] Error 1 Your help will be very appreciated 🙂
@AstuteJoe
@AstuteJoe Год назад
@@PsychologicalHeat Did you generate the .txt.gt files? Those are text files with the actual text in them
@PsychologicalHeat
@PsychologicalHeat Год назад
​@@AstuteJoe Yes, I have all gt.txt, .box, and .tiff files I think the problem is that I want the ocr to read only uppercase letters? I have made a custom training_text file and it only has numbers, '-' and uppercase letters. I played around with it and now this is the output: find -L data/myFont-ground-truth -name '*.gt.txt' | xargs paste -s > "data/myFont/all-gt" unicharset_extractor --output_unicharset "data/myFont/unicharset" --norm_mode 2 "data/myFont/all-gt" Bad box coordinates in boxfile string! 36-XR-34928-PN-54460-TN-50758-XB-02919-JP-10263-DG-99350-MF-07358-PK-31144-MB-35731-ZX-758 Extracting unicharset from plain text file data/myFont/all-gt Other case x of X is not in unicharset Other case r of R is not in unicharset Other case p of P is not in unicharset Other case n of N is not in unicharset Other case t of T is not in unicharset Other case b of B is not in unicharset Other case j of J is not in unicharset Other case d of D is not in unicharset Other case g of G is not in unicharset Other case m of M is not in unicharset Other case f of F is not in unicharset Other case k of K is not in unicharset Other case z of Z is not in unicharset Wrote unicharset file data/myFont/unicharset make: *** No rule to make target `data/myFont-ground-truth/myFont_1.lstmf', needed by `data/myFont/all-lstmf'. Stop.
@hugolearn
@hugolearn 2 месяца назад
So I actually followed this through, handy scripts.. However Seems to have seriously overfit my data. No augmentations? No variance in font size or spacing? I notice in this video you only actually evaluate your trained model against a ground truth image. This all looks technically correct but as it stands still kinda useless for any practical application? How's the output if you generate a new text without text2image and run it against that ?
@AstuteJoe
@AstuteJoe 2 месяца назад
I imagine you could edit the tex2image utility source code to introduce the variance you need, tesseract is open source
@3ombieautopilot
@3ombieautopilot Год назад
Hello! Can you make a video about how to make tesseract to recognize a character which is out of eng.traineddata? Like ± , Ó mixed with some english text
@adityanjsg99
@adityanjsg99 Год назад
Train it and the use it
@asiburrahman3623
@asiburrahman3623 Год назад
I didn't get the font part. Where did you put the font?
@AstuteJoe
@AstuteJoe Год назад
It has to be installed on your system, each OS will have a different way of doing it
@asiburrahman3623
@asiburrahman3623 Год назад
@@AstuteJoe i'm using ubuntu. Is there any way to specify the directory?
@AstuteJoe
@AstuteJoe Год назад
@@asiburrahman3623 askubuntu.com/questions/3697/how-do-i-install-fonts
@asiburrahman3623
@asiburrahman3623 Год назад
@@AstuteJoe I have installed the font but still this error shows: Fontconfig warning: "/tmp/fonts.conf", line 4: empty font directory name ignored Fontconfig warning: "/tmp/fonts.conf", line 4: empty font directory name ignored Could not find font named 'Apex'.
@kannapatudompant8535
@kannapatudompant8535 Год назад
@@asiburrahman3623 I also have the same problem. I tried to add '--fontconfig_tmpdir={fontconf_dir}'. >> the default is /tmp which doesn't have our font directory in it. fonts.conf is usually located in etc/share/fonts. Now, I could create .box and .tif files. Hope this solution could solve your issue too.
@TuanLe-ve7lm
@TuanLe-ve7lm Год назад
hi Gabo, May I please see your fonts.conf file?
@AstuteJoe
@AstuteJoe Год назад
Not even sure what is this file now but here you go, this one is on my home folder: /home/gabri/tesseract_training/apex_legends.otf
@AstuteJoe
@AstuteJoe Год назад
This one is on the tesseract project folder:
@TuanLe-ve7lm
@TuanLe-ve7lm Год назад
I have made a good progress today, I am able to train the Apex font, however when I switch to another font Nato Sans, it's able to generate box and tff but it shows error while training "Makefile:219: *** found no data/Noto Sans-ground-truth/*.gt.txt for Sans/all-gt. Stop." . Seem it does not accept font's name with space in middle ..
@AstuteJoe
@AstuteJoe Год назад
@@TuanLe-ve7lm That could definitely be it, spaces and Linux (or Windows) don't mix well
@kallemyllynen9571
@kallemyllynen9571 6 месяцев назад
Running this on Windows I had to modify the Makefile to make it work
@ikedoriens6149
@ikedoriens6149 Год назад
Jezus. Isn't there just a command line possibility like in Tesseract 4.0? This seems a bit complicated for someone who's not into programming.
@blndazeez1973
@blndazeez1973 Год назад
Hi Gabriel, Great Video! One questions, when I try to retrain Arabic model using this command "TESSDATA_PREFIX=../tesseract/tessdata make training MODEL_NAME=Apex START_MODEL=ara TESSDATA=../tesseract/tessdata MAX_ITERATIONS=200" It gives me below error: "Error opening data file ../tesseract/tessdata/eng.traineddata" The problem I am not using the English model. hanks for the video again!
@AstuteJoe
@AstuteJoe Год назад
That's really odd, I see you changed the START_MODEL so it should work, not super sure now
@AstuteJoe
@AstuteJoe Год назад
Do you have ara.traineddata in the tessdata folder?
@blndazeez1973
@blndazeez1973 Год назад
@@AstuteJoe Yes I have and made sure of it couple of times
@AstuteJoe
@AstuteJoe Год назад
@@blndazeez1973 Maybe it's because the Apex model was already created when you were trying it out? And it's already on top of the eng trained data?
@blndazeez1973
@blndazeez1973 Год назад
@@AstuteJoe I redo the steps with different model name but gives me the same error, that is strange.
@rcraftg4mer42
@rcraftg4mer42 7 месяцев назад
i love you
@AstuteJoe
@AstuteJoe 7 месяцев назад
lol i love you too
@datarkmveri2228
@datarkmveri2228 Год назад
please help
@kurobane_sama
@kurobane_sama 9 дней назад
Impossible to use another language than english :(
@saviomilbratz
@saviomilbratz 22 дня назад
Training Tesseract is almost an impossible task. There could be an easier way just using Pyhton or something simpler. For regular Windows user like me, this task is almost impossible.
@focusofLandD
@focusofLandD Год назад
Hi, Gabriel: I am getting this error: at the last training step when I am trying to train a new font called Bender: Failed to read data from : data/bender/bender.worldlist Failed to read data from : data/bender/bender.punc Failed to read data from : data/bender/bender.numbers Failed to read data from : data/bender/bender.config Invalid format in radical table at line 0: 19886 3 23 6 3
@notAvn
@notAvn Год назад
did you manage to train tesseract for bender yet?
@_nom_
@_nom_ Год назад
No rule to make target 'data/eng-ground-truth/eng.training_text.lstmf'
@Kronzplayz.
@Kronzplayz. Год назад
kindly help i'm getting an error while training plz @AstuteJoe Failed to read data from: data/OCRA/OCRA.wordlist Failed to read data from: data/OCRA/OCRA.punc Failed to read data from: data/OCRA/OCRA.numbers Loaded unicharset of size 112 from file data/OCRA/unicharset Setting unichar properties Other case É of é is not in unicharset Setting script properties Failed to load script unicharset from:data/langdata/Latin.unicharset Config file is optional, continuing... Failed to read data from: data/langdata/OCRA/OCRA.config Failed to read data from: data/langdata/radical-stroke.txt Error reading radical code table data/langdata/radical-stroke.txt make: *** [Makefile:293: data/OCRA/OCRA.traineddata] Error 1
@Kronzplayz.
@Kronzplayz. Год назад
I solved this issue 😅
@enriqueortiz5875
@enriqueortiz5875 Год назад
@@Kronzplayz. how you solved it? I got the same issue
@user-of2lm9ii5g
@user-of2lm9ii5g Год назад
@@enriqueortiz5875 solved it: need to run in tesstrain folder: make leptonica tesseract make tesseract-langdata
@datarkmveri2228
@datarkmveri2228 Год назад
Hi, When I try to Run training command it give a error can you please help me -------> Config file is optional, continuing... Failed to read data from: data/langdata/Apex/Apex.config Failed to read data from: data/langdata/radical-stroke.txt Error reading radical code table data/langdata/radical-stroke.txt make: *** [Makefile:293: data/Apex/Apex.traineddata] Error 1
@datarkmveri2228
@datarkmveri2228 Год назад
command : TESSDATA_PREFIX=../tesseract/tessdata make training MODEL_NAME=Apex START_MODEL=eng TESSDATA=../tesseract/tessdata MAX_ITERATIONS=100 combine_tessdata -u ../tesseract/tessdata/eng.traineddata data/eng/Apex
@datarkmveri2228
@datarkmveri2228 Год назад
tesseract "data/Apex-ground-truth/eng_44.tif" data/Apex-ground-truth/eng_44 --psm 13 lstm.train + tesseract data/Apex-ground-truth/eng_44.tif data/Apex-ground-truth/eng_44 --psm 13 lstm.train python3 shuffle.py 0 "data/Apex/all-lstmf" + head -n 90 data/Apex/all-lstmf + tail -n 10 data/Apex/all-lstmf combine_lang_model \ --input_unicharset data/Apex/unicharset \ --script_dir data/langdata \ --numbers data/Apex/Apex.numbers \ --puncs data/Apex/Apex.punc \ --words data/Apex/Apex.wordlist \ --output_dir data \ \ --lang Apex Failed to read data from: data/Apex/Apex.wordlist Failed to read data from: data/Apex/Apex.punc Failed to read data from: data/Apex/Apex.numbers Loaded unicharset of size 113 from file data/Apex/unicharset Setting unichar properties Other case É of é is not in unicharset Other case FI of fi is not in unicharset Setting script properties Failed to load script unicharset from:data/langdata/Latin.unicharset Warning: properties incomplete for index 3 = C Warning: properties incomplete for index 4 = H Warning: properties incomplete for index 5 = E Warning: properties incomplete for index 6 = S Warning: properties incomplete for index 7 = - Warning: properties incomplete for index 8 = R Warning: properties incomplete for index 9 = I Warning: properties incomplete for index 10 = K Warning: properties incomplete for index 11 = N Warning: properties incomplete for index 12 = G Warning: properties incomplete for index 13 = B Warning: properties incomplete for index 14 = 8 Warning: properties incomplete for index 15 = 5
@user-of2lm9ii5g
@user-of2lm9ii5g Год назад
@@datarkmveri2228 solved it: need to run in tesstrain folder: make leptonica tesseract make tesseract-langdata
@user-of2lm9ii5g
@user-of2lm9ii5g Год назад
Hello, how to fix it? Failed to read data from: data/langdata/Apex/Apex.config Failed to read data from: data/langdata/radical-stroke.txt Error reading radical code table data/langdata/radical-stroke.txt make: *** [Makefile:293: data/Apex/Apex.traineddata] Error 1
@user-of2lm9ii5g
@user-of2lm9ii5g Год назад
solved it: need to run in tesstrain folder: make leptonica tesseract make tesseract-langdata
@user-yj8eh5ft9m
@user-yj8eh5ft9m Год назад
thanks
@ganeshrajv130
@ganeshrajv130 Год назад
read_params_file: Can't open make read_params_file: Can't open training read_params_file: Can't open MODEL_NAME=nakula_hin read_params_file: Can't open START_MODEL=hin read_params_file: Can't open TESSDATA=/usr/local/share/tessdata/ read_params_file: Can't open MAX_ITERATIONS=10 Error, cannot read input file TESSDATA_PREFIX: No such file or directory Error during processing. This is what the error I get even though i did followed ur step
@faint.2396
@faint.2396 Год назад
Hi I'm getting this error: Traceback (most recent call last): File "C:\Users\HAVASIZ\Desktop\tesseract_tutorial\split_training_text.py", line 34, in subprocess.run([ File "C:\Users\HAVASIZ\AppData\Local\Programs\Python\Python310\lib\subprocess.py", line 501, in run with Popen(*popenargs, **kwargs) as process: File "C:\Users\HAVASIZ\AppData\Local\Programs\Python\Python310\lib\subprocess.py", line 969, in __init__ self._execute_child(args, executable, preexec_fn, close_fds, File "C:\Users\HAVASIZ\AppData\Local\Programs\Python\Python310\lib\subprocess.py", line 1438, in _execute_child hp, ht, pid, tid = _winapi.CreateProcess(executable, args, FileNotFoundError: [WinError 2]
@TuanLe-ve7lm
@TuanLe-ve7lm Год назад
same to me, have you had a solution yet
@faint.2396
@faint.2396 Год назад
@@TuanLe-ve7lm No, sadly I gave up on how to train Tesseract 5. I'm going to try to learn how to train Tesseract 4 because there are a lot more videos on youtube.
@faint.2396
@faint.2396 Год назад
@@TuanLe-ve7lm I actually fixed the issue by using Linux. But now I get other errors lol
@abdeldjalilchougui
@abdeldjalilchougui Год назад
@@faint.2396 Did you fix your problem ?
@sebastianorzechowski4613
@sebastianorzechowski4613 3 месяца назад
I think it could be related with text2image itself. You have to provide path to text2image.exe which in general is located in installed tesseract.
@utkarshmishra6194
@utkarshmishra6194 Год назад
Hi Gabriel, hope you doing well I ran this command TESSDATA_PREFIX=/mnt/c/Users/Asus/PycharmProjects/tesseract_tutorial/tesseract/tessdata make training MODEL_NAME=Apex START_MODEL=eng TESSDATA=/mnt/c/Users/Asus/PycharmProjects/tesseract_tutorial/tesseract/tessdata MAX_ITERATIONS=400 But I am getting error Failed to read data from: data/Apex/Apex.wordlist Failed to read data from: data/Apex/Apex.punc Failed to read data from: data/Apex/Apex.numbers Failed to read data from: data/langdata/Apex/Apex.config Null char=2 lstmtraining \ --debug_interval 0 \ --traineddata data/Apex/Apex.traineddata \ --old_traineddata /mnt/c/Users/Asus/PycharmProjects/tesseract_tutorial/tesseract/tessdata/eng.traineddata \ --continue_from data/eng/Apex.lstm \ --learning_rate 0.0001 \ --model_output data/Apex/checkpoints/Apex \ --train_listfile data/Apex/list.train \ --eval_listfile data/Apex/list.eval \ --max_iterations 1000 \ --target_error_rate 0.01 Failed to load list of training filenames from data/Apex/list.train make: *** [Makefile:319: data/Apex/checkpoints/Apex_checkpoint] Error 1
@nithyavenugopal6834
@nithyavenugopal6834 10 месяцев назад
Hi, were you able to solve this error? If so, how?
@athosmba1766
@athosmba1766 11 месяцев назад
When I use the code TESSDATA_PREFIX=.../tesseract/tessdata make training model_NAME=Apex Start_MODEL=eng TESSDATA=.../tesseract/tessdata MAX_INTERATION=100 it's not work, giving an error about the comand TESSDATA=........
@athosmba1766
@athosmba1766 11 месяцев назад
someone can help me?
@Ethiopic
@Ethiopic 11 месяцев назад
Are you getting "not recognized" error. I am getting the same error on Windows. The exact command works fine on the Mac. Very strange. Do you find a solution?
@vishnubalaji9500
@vishnubalaji9500 Год назад
understood jack shit from this video needs more dumbing down
@faint.2396
@faint.2396 Год назад
fr and I did every step the same and I'm getting errors. Why isn't training Tesseract 5 simple as Tesseract 4? And the thing is there's only ONE video on how to train Tesseract 5 and its this one.
@sayantanbiswas9702
@sayantanbiswas9702 3 месяца назад
tesseract data/coc-ground-truth/eng_2.tif stdout --tessdata-dir /home/godmode2/tesseract_tutori al/tesstrain/data --psm 7 -l coc --loglevel ALL
@sayantanbiswas9702
@sayantanbiswas9702 3 месяца назад
TESSDATA_PREFIX=../tesseract/tessdata make training MODEL_NAME=coc START_MODEL= eng TESSDATA=../tesseract/tessdata MAX_ITERATIONS=10000
Далее
Optical Character Recognition (OCR) - Computerphile
14:16
Training/Fine Tuning Tesseract OCR LSTM for New Fonts
22:34
A new way to generate worlds (stitched WFC)
10:51
Просмотров 520 тыс.