Tesseract 5.3.4: A Leap Forward in Optical Text Recognition

A new point release of Tesseract 5.3.4 OCR with improvements is now available.

Tesseract OCR is a free application and a standard for most OCR use cases. The latest release, Tesseract 5.3.4, is set to enhance and OCR work with a bunch of key improvements. Sponsored by Google since 2006, Tesseract has come a long way since its initial development by Hewlett Packard between 1985 and 1998, evolving into one of the most advanced OCR systems on the market.

Tesseract offers two distinct recognition engines: the classic engine, which identifies text at the level of individual character patterns, and a novel engine leveraging machine learning, specifically a Long Short-Term Memory (LSTM) recurrent neural network. This advanced engine optimizes string recognition, resulting in a remarkable boost in accuracy. The availability of ready-made trained models for 123 languages further streamlines the user experience.

Tesseract 5.3.4: Key highlights

UTF-8 Support and Multilingual Capabilities

Tesseract 5.3.4 introduces key features that enhance its usability across a multitude of languages. The system now supports the recognition of UTF-8 characters and texts in over 100 languages, spanning from widely spoken ones like English and Spanish to languages with unique scripts such as Russian, Kazakh, Belarusian, and Ukrainian. This update caters to a global user base, making Tesseract a versatile choice for diverse linguistic needs.

More output formats

One of the standout features of Tesseract has always been its adaptability, and the 5.3.4 release takes this a step further. Users can now save their OCR results in various formats, including plain text, HTML (hOCR), ALTO (XML), PDF, and TSV. This flexibility ensures that users can seamlessly integrate Tesseract into their existing workflows, easily accommodating different project requirements.

Performance Optimization

Tesseract 5.3.4 doesn’t just stop at feature enhancements; it leverages modern technologies to optimize performance. The inclusion of modules using OpenMP and SIMD instructions like AVX2, AVX, AVX512F, NEON, or SSE4.1 ensures that users experience swift and efficient text recognition.

Key Improvements in Tesseract 5.3.4:

Enhanced Image Recognition via URL: The new version improves image recognition by enabling URL-based file downloads using the libcurl library. This feature allows for greater flexibility and ease of use, with the User-Agent header being set during loading. A notable addition is the “curl_cookiefile” parameter for utilizing a cookie file.
TCP Protocol for ScrollView Server: The ScrollView server now favours TCP as its preferred protocol, enhancing communication efficiency.
Improved User Experience: The command “combine_tessdata -d” has been refined to provide output to stdout instead of stderr, improving the overall user experience.
Build Issue Fixes: Tesseract 5.3.4 addresses build issues when using autoconf and clang, ensuring a smoother installation process.

As Tesseract continues to set new benchmarks in OCR technology, the 5.3.4 release cements its position as a leading, open-source solution. This version should arrive in Ubuntu, Debian and other distribution repositories within a few days.

You can download this release from the official page.

Via release notes

Add DebugPoint to your Google News feed.

Join our Telegram Channel and stay informed on the move.

Using Mastodon? Follow: floss.social/@debugpoint