Text Recognition with ML Kit for Android: Getting Started

10 5 minutes read

Text Recognition with ML Kit for Android: Getting Started

ML Kit is a mobile SDK from Google that uses machine learning to solve problems such as text recognition, text translation, object detection, face/pose detection, and so much more!

The APIs can run on-device, enabling you to process real-time use cases without sending data to servers.

ML Kit provides two groups of APIs:

Vision APIs: These include barcode scanning, face detection, text recognition, object detection, and pose detection.
Natural Language APIs: You use them whenever you need to identify languages, translate text, and perform smart replies in text conversations.

This tutorial will focus on Text Recognition. With this API you can extract text from images, documents, and camera input in real time.

In this tutorial, you’ll learn:

What a text recognizer is and how it groups text elements.
The ML Kit Text Recognition features.
How to recognize and extract text from an image.

Getting Started

Throughout this tutorial, you’ll work with Xtractor. This app lets you take a picture and extract the X usernames. You could use this app in a conference whenever the speaker shows their contact data and you’d like to look for them later.

Use the Download Materials button at the top or bottom of this tutorial to download the starter project.

Once downloaded, open the starter project in Android Studio Meerkat or newer. Build and run, and you’ll see the following screen:

Clicking the plus button will let you choose a picture from your gallery. But, there won’t be any text recognition.

Before adding text recognition functionality, you need to understand some concepts.

Using a Text Recognizer

A text recognizer can detect and interpret text from various sources, such as images, videos, or scanned documents. This process is called OCR, which stands for: Optical Character Recognition.

Some text recognition use cases might be:

Scanning receipts or books into digital text.
Translating signs from static images or the camera.
Automatic license plate recognition.
Digitizing handwritten forms.

Here’s a breakdown of what a text recognizer typically does:

Detection: Finds where the text is located within an image, video, or document.
Recognition: Converts the detected characters or handwriting into machine-readable text.
Output: Returns the recognized text.

ML Kit Text Recognizer segments text into blocks, lines, elements, and symbols.

Here’s a brief explanation of each one:

Block: Shows in red, a set of text lines, e.g. a paragraph or column.
Line: Shows in blue, a set of words.
Element: Shows in green, a set of alphanumeric characters, a word.
Symbol: Single alphanumeric character.

ML Kit Text Recognition Features

The API has the following features:

Recognize text in various languages. Including Chinese, Devanagari, Japanese, Korean, and Latin. These were included in the latest (V2) version. Check the supported languages here.
Can differentiate between a character, a word, a set of words, and a paragraph.
Identify the recognized text language.
Return bounding boxes, corner points, rotation information, confidence score for all detected blocks, lines, elements, and symbols
Recognize text in real-time.

Bundled vs. Unbundled

All ML Kit features make use of Google-trained machine learning models by default.

Particularly, for text recognition, the models can be installed either:

Unbundled: Models are downloaded and managed via Google Play Services.
Bundled: Models are statically linked to your app at build time.

Using bundled models means that when the user installs the app, they’ll also have all the models installed and will be usable immediately. Whenever the user uninstalls the app, all the models will be deleted. To update the models, first the developer has to update the models, publish the app, and the user has to update the app.

On the other hand, if you use unbundled models, they’re stored in Google Play Services. The app has to first download them before use. When the user uninstalls the app, the models will not necessarily be deleted. They’ll only be deleted if all apps that depend on those models are uninstalled. Whenever a new version of the models are released, they’ll be updated to be used in the app.

Depending on your use case, you may choose one option or the other.

It’s suggested to use the unbundled option if you want a smaller app size and automated model updates by Google Play Services.

However, you should use the bundled option if you want your users to have full feature functionality right after installing the app.

Adding Text Recognition Capabilities

To use ML Kit Text Recognizer, open your app’s build.gradle file of the starter project and add the following dependency:


implementation("com.google.mlkit:text-recognition:16.0.1")
implementation("org.jetbrains.kotlinx:kotlinx-coroutines-play-services:1.10.2")

Here, you’re using the text-recognition bundled version.

Now, sync your project.

Note: To get the latest version of text-recognition, please check here.
To get the latest version of kotlinx-coroutines-play-services, check here. And, to support other languages, use the corresponding dependency. You can check them here.

Now, replace the code of recognizeUsernames with the following:


val image = InputImage.fromBitmap(bitmap, 0)
val recognizer = TextRecognition.getClient(TextRecognizerOptions.DEFAULT_OPTIONS)
val result = recognizer.process(image).await()

return emptyList()

You first get an image from a bitmap. Then, you get an instance of a TextRecognizer using the default options, with Latin language support. Finally, you process the image with the recognizer.

You’ll need to import the following:


import com.google.mlkit.vision.text.TextRecognition
import com.google.mlkit.vision.text.latin.TextRecognizerOptions
import com.kodeco.xtractor.ui.theme.XtractorTheme
import kotlinx.coroutines.tasks.await

Note: To support other languages pass the corresponding option. You can check them here.

You could obtain blocks, lines, and elements like this:


// 1
val text = result.text

for (block in result.textBlocks) {
 // 2
 val blockText = block.text
 val blockCornerPoints = block.cornerPoints
 val blockFrame = block.boundingBox

 for (line in block.lines) {
 // 3
 val lineText = line.text
 val lineCornerPoints = line.cornerPoints
 val lineFrame = line.boundingBox

 for (element in line.elements) {
 // 4
 val elementText = element.text
 val elementCornerPoints = element.cornerPoints
 val elementFrame = element.boundingBox
 }
 }
}

Here’s a brief explanation of the code above:

First, you get the full text.
Then, for each block, you get the text, the corner points, and the frame.
For each line in a block, you get the text, the corner points, and the frame.
Finally, for each element in a line, you get the text, the corner points, and the frame.

However, you only need the elements that represent X usernames, so replace the emptyList() with the following code:


return result.textBlocks
 .flatMap { it.lines }
 .flatMap { it.elements }
 .filter { element -> element.text.isXUsername() }
 .mapNotNull { element ->
 element.boundingBox?.let { boundingBox ->
 UsernameBox(element.text, boundingBox)
 }
 }

You converted the text blocks into lines, for each line you get the elements, and for each element, you filter those that are X usernames. Finally, you map them to UsernameBox which is a class that contains the username and the bounding box.

The bounding box is used to draw rectangles over the username.

Now, run the app again, choose a picture from your gallery, and you’ll get the X usernames recognized: