Enhancing Search Capabilities for Non-English Datasets in the Dataset Viewer

Community Article Published July 10, 2024

Recently, we introduced an enhancement to the Dataset Viewer that enables full-text search functionality across non-English monolingual datasets. This feature significantly expands the scope of search capabilities within our platform.

Understanding Full-Text Search

Full-text search (FTS) differs from traditional search methods by allowing searches across entire documents based on relevance rather than exact word or phrase matches. This approach provides more comprehensive results within specified query ranges.

At Hugging Face, we integrate the DuckDB FTS Extension to index all text columns within datasets. This indexing process involves creating an index file, which is then seamlessly integrated into our repository at the "refs/convert/duckdb" branch.

Implementing the FTS Extension

Implementing full-text search with DuckDB is straightforward. Once data is ingested, the following command is used:

create_fts_index(input_table, input_id, *input_values, stemmer = 'porter',
                 stopwords = 'english', ignore = '(\\.|[^a-z])+',
                 strip_accents = 1, lower = 1, overwrite = 0)

The 'stemmer' parameter in this command is crucial for enabling FTS across different languages. The previous example shows that it uses 'porter' by default, which assumes English text. Stemming, a key text-preprocessing technique in natural language processing (NLP), reduces words to their root form, known as a "lemma" in linguistic terms.

Impact of Stemming on Search

DuckDB's stemming algorithm enhances search recall by matching variations of words to their common root. For instance, words like "running," "ran," and "runs" all stem from "run," thereby ensuring more comprehensive search results.

In the Dataset Viewer, this "stemmer" significantly improves the accuracy and relevance of search results. Users can now explore and retrieve relevant information from non-English datasets more efficiently, enhancing their overall experience with our platform. To enable this feature, you need to:

  1. Assign the language to your dataset in the dataset card
  2. Perform searches!

Let's analyze the following Spanish dataset: https://huggingface.co/datasets/asoria/es-text

image/png

As you can see, we have some phrases containing the words: "jugó", "juego" and "jugado" which are somehow related. Now, lets perform a search to try to list all the phrases that would match "jugar" word:

image/png

As you can see, it just included phrases that exactly match the word "jugar.", this is because this dataset does not have the language assigned yet:

image/png

Setting the dataset language

To enable full-text search for non-English datasets in the Dataset Viewer, follow these steps:

  1. Go to "Files and versions" tab

image/png

  1. Open the README.md file

image/png

  1. Choose "Spanish" in the languages field

image/png

  1. Commit your change

image/png

Tip: You can configure the language using the Dataset Card documentation

The language tag will look like this:

image/png

After refreshing the Dataset Viewer:

image/png

That is, our dataset has been correctly configured to perform searches based on the language and now it gets the results we want.

Considerations

  • We currently support the following languages: 'arabic', 'basque', 'catalan', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'greek', 'hindi', 'hungarian', 'indonesian', 'irish', 'italian', 'lithuanian', 'nepali', 'norwegian', 'porter', 'portuguese', 'romanian', 'russian', 'serbian', 'spanish', 'swedish', 'tamil' and 'turkish'.
  • It is intended for monolingual datasets only (We are still investigating how to make cool things over multilingual datasets!)