OCR with the Tesseract interface

Take a look at tessnet


The source code seemed to be geared for an executable, you might need to rewire stuffs a bit so it would build as a DLL instead. I don't have much experience with Visual C++ but I think it shouldn't be too hard with some research. My guess is that someone might have had made a library version already, you should try Google.

Once you have tesseract-ocr code in a DLL file, you can then import the file into your C# project via Visual Studio and have it create wrapper classes and do all the marshaling stuffs for you. If you can't import then DllImport will let you call the functions in the DLL from C# code.

Then you can take a look at the original executable to find clues on what functions to call to properly OCR a tiff image.


C# program launches tesseract.exe and then reads the output file of tesseract.exe.

Process process = Process.Start("tesseract.exe", "out");
process.WaitForExit();
if (process.ExitCode == 0)
{
    string content = File.ReadAllText("out.txt");
}

I discovered today that EMGU now includes a Tesseract wrapper. While the number of unmanaged dlls of the opencv lib might seem a little daunting, it's nothing that a quick copy to your output directory won't cure. From there the actual OCR process is as simple as three lines:

Tesseract ocr = new Tesseract(Path.Combine(Environment.CurrentDirectory, "tessdata"), "eng", Tesseract.OcrEngineMode.OEM_TESSERACT_ONLY);
this.ocr.Recognize(clip);
optOCR.Text = this.ocr.GetText();

"robomatics" put together a very nice youtube video that demonstrates a simple but effective solution.


Disclaimer: I work for Atalasoft

Our OCR module supports Tesseract and if that proves to not be good enough, you can upgrade to a better engine and just change one line of code (we provide a common interface to multiple OCR engines).


Comments

  1. Ty

    • 2016/5/7

    Net, X, Proprietary, A graphical interface to tesseract 4.0 Windows Desktop - Essentially a graphical user interface (GUI) for the Tesseract OCR engine.

  2. Armando

    • 2020/8/30

    ImageGear OCR supports multiple languages and platforms. Free Trial. Powerful OCR SDK For .NET, C, C++ & C# from an industry leader. Simple integration process

  3. Javion

    • 2020/1/6

    How do you OCR an tiff file using Tesseract's interface in c#? Currently I only know how to do it using the executable. Share.

  4. Emiliano

    • 2017/11/24

    Extract handwriting, text, checkmarks, design better forms with ICR software. Request your personal demo now on our live calendar using our scheduling system.

  5. Kolten

    • 2017/9/10

    The Command-Line Interface (CLI) is the user's window into the computer operating window. The user uses text-based commands to instruct the 

  6. Desmond

    • 2018/12/13

    The perfect solution to use OCR Converter files and documents. Simple and quick. An essential daily tool for easily use OCR Converter Software files.

  7. Iker

    • 2021/6/30

    In this blog post, we will put focus on Tesseract OCR and find out through a browser based UI without writing a single line of code, 

  8. Cain

    • 2020/6/23

    Find the Utility Software You Need at CDW.com. Get Licensing Advice Here.

  9. Gutierrez

    • 2015/3/9

    An optical character recognition (OCR) engine. Tesseract is an OCR engine with support for unicode and the ability to recognize more than 100 languages out of 

  10. Laci

    • 2017/6/2

    Invoice-To-Pay And End-To-End AP Automation System With Built-In OCR Invoice Scanning. Automated Invoice-To-Pay System. Built-In OCR Scanning For Automated Invoice Processing.

  11. Dangelo

    • 2018/4/30

    Rtesseract package. This is an R interface to the tesseract OCR (Optical Character Recognition) system. tesseract is available at 

  12. Waylon

    • 2019/7/28

    From there the actual OCR process is as simple as three lines: Tesseract ocr = new Tesseract(Path.Combine(Environment.CurrentDirectory, "tessdata"), "eng", Tesseract.OcrEngineMode.OEM_TESSERACT_ONLY); this.ocr.Recognize(clip); optOCR.Text = this.ocr.GetText();

  13. Zane

    • 2017/5/23

    A free Windows graphical interface to the Tesseract 4.0 OCR engine. - GitHub - OpaitSoftware/TesseractStudio.Net: A free Windows graphical interface to the 

  14. Jesiah

    • 2016/4/13

    Tesseract itself is free software, originally developed by Hewlett-Packard until 2006 when Google took over the development. It is arguably the best out of the box OCR engine until today, with support for more than 100 languages. It’s one of the most popular OCR engines, as it’s easy to install and use.

  15. Gabriel

    • 2020/4/18

    Gain hands-on experience using Tesseract to OCR an image The power of pytesseract is our ability to interface with Tesseract rather than 

  16. Cade

    • 2016/5/19

    We’ll be using Tesseract OCR using its command line interface. Open your terminal (or for Windows, your command prompt), and type in the following: tesseract -l eng FILENAME_OF_YOUR_IMAGE.jpg

  17. Ramirez

    • 2019/10/26

    Rather, it simply provides an interface to the tesseract binary. If you take a look at the project on GitHub you'll see that the library is 

  18. Lee

    • 2015/9/9

    Figure 1: Our first example input for Optical Character Recognition using Python. Using the Tesseract binary, as we learned last week, we can apply OCR to the raw, unprocessed image: $ tesseract images/example_01.png stdout Noisy image to test Tesseract OCR Tesseract performed well with no errors in this case.

  19. Devon

    • 2017/4/13

    Get the data you want to process. Write a Python script to process the images with Tesseract and output them in Label Studio format. Install Label Studio and set up your project. Correct the OCR results in the Label Studio UI. Export the final results to train a machine learning model or to use for data analysis.

Comments are closed.

Recent Posts