You can train a new font with tesseract in google colab too . Link to the google colab : colab.research.google.com/git... #aniquemaniac #tesseract #googlecolab
bro, your final output text file looks formatted line by line when compared to the source image file.. looks good.. my output text file comes as a single large paragraph without any line formatting.. why so ?
I'm really confused if it is really working I hope you add first the detection of the default pytesseract and compare it to the result of pytesseract with the trained data file so that we can see if the training is effective or not
how i can train the arabic lang i know there were a ara.traindata but i need to add new characters my question is how i can prepare my data and the font file just for creating the train data only or what if there is any link for discussions i will be hopeful
that's what i have been looking for dear , but since m a newbie with tesseract m a bit curious with the train data u have used , is it the pdf image or text line image dear ? and after training it can this model be used with web application ? Looking forward to hearing back from you thanks
I used a font file (typouprighBT.ttf ) to generate trained data. If you have only image or pdf file you can identify the font type of the image ..from website www.myfonts.com/whatthefont, then search the font download it and generate train data . And for web application i never tried it , may be there is some ways to use the trained model with tesseract.js
@@aniquemaniac oh sorry to get back to you this late , but still thanks again dear . I have followed ur step and it’s working but since you have directly use .ttf font then is that okie to increase the max_page based on our preferences? Is it going to be overfit dear ? Looking forward to hearing back from you .
@@aniquemaniac okay plz could you tell me how to tune the LSTM model like I need to use different activation funtion and so Plus I need to retrain for hindi language with tesseract 4 is that possible with your colab code
Running the "tesstrain.sh" throws an error all the time. Does anybody know the reason? == Constructing LSTM training data === [Wed Jun 15 19:18:35 UTC 2022] /usr/bin/combine_lang_model --input_unicharset /tmp/eng-2022-06-15.xPx/eng.unicharset --script_dir /content/drive/MyDrive/langdata_lstm --words /content/drive/MyDrive/langdata_lstm/eng/eng.wordlist --numbers /content/drive/MyDrive/langdata_lstm/eng/eng.numbers --puncs /content/drive/MyDrive/langdata_lstm/eng/eng.punc --output_dir train --lang eng Loaded unicharset of size 69 from file /tmp/eng-2022-06-15.xPx/eng.unicharset Setting unichar properties Setting script properties Config file is optional, continuing... Failed to read data from: /content/drive/MyDrive/langdata_lstm/eng/eng.config Null char=2 Invalid format in radical table at line 0: 19886 3 23 6 3 Creation of encoded unicharset failed!! Error writing recoder!! Reducing Trie to SquishedDawg Error during conversion of wordlists to DAWGs!! ERROR: Program combine_lang_model failed. Abort.