Enhancing Search Capabilities for Non-English Datasets in the Dataset Viewer
Recently, we introduced an enhancement to the Dataset Viewer that enables full-text search functionality across non-English monolingual datasets. This feature significantly expands the scope of search capabilities within our platform.
Understanding Full-Text Search
Full-text search (FTS) differs from traditional search methods by allowing searches across entire documents based on relevance rather than exact word or phrase matches. This approach provides more comprehensive results within specified query ranges.
At Hugging Face, we integrate the DuckDB FTS Extension to index all text columns within datasets. This indexing process involves creating an index file, which is then seamlessly integrated into our repository at the "refs/convert/duckdb" branch.
Implementing the FTS Extension
Implementing full-text search with DuckDB is straightforward. Once data is ingested, the following command is used:
create_fts_index(input_table, input_id, *input_values, stemmer = 'porter',
stopwords = 'english', ignore = '(\\.|[^a-z])+',
strip_accents = 1, lower = 1, overwrite = 0)
The 'stemmer' parameter in this command is crucial for enabling FTS across different languages. The previous example shows that it uses 'porter' by default, which assumes English text. Stemming, a key text-preprocessing technique in natural language processing (NLP), reduces words to their root form, known as a "lemma" in linguistic terms.
Impact of Stemming on Search
DuckDB's stemming algorithm enhances search recall by matching variations of words to their common root. For instance, words like "running," "ran," and "runs" all stem from "run," thereby ensuring more comprehensive search results.
In the Dataset Viewer, this "stemmer" significantly improves the accuracy and relevance of search results. Users can now explore and retrieve relevant information from non-English datasets more efficiently, enhancing their overall experience with our platform. To enable this feature, you need to:
- Assign the language to your dataset in the dataset card
- Perform searches!
Let's analyze the following Spanish dataset: https://huggingface.co/datasets/asoria/es-text
As you can see, we have some phrases containing the words: "jugó", "juego" and "jugado" which are somehow related. Now, lets perform a search to try to list all the phrases that would match "jugar" word:
As you can see, it just included phrases that exactly match the word "jugar.", this is because this dataset does not have the language assigned yet:
Setting the dataset language
To enable full-text search for non-English datasets in the Dataset Viewer, follow these steps:
- Go to "Files and versions" tab
- Open the README.md file
- Choose "Spanish" in the languages field
- Commit your change
Tip: You can configure the language using the Dataset Card documentation
The language tag will look like this:
After refreshing the Dataset Viewer:
That is, our dataset has been correctly configured to perform searches based on the language and now it gets the results we want.
Considerations
- We currently support the following languages: 'arabic', 'basque', 'catalan', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'greek', 'hindi', 'hungarian', 'indonesian', 'irish', 'italian', 'lithuanian', 'nepali', 'norwegian', 'porter', 'portuguese', 'romanian', 'russian', 'serbian', 'spanish', 'swedish', 'tamil' and 'turkish'.
- It is intended for monolingual datasets only (We are still investigating how to make cool things over multilingual datasets!)