Speech Recognition Community Event Version 2

non-profit

Activity Feed

AI & ML interests

Multi-Lingual Speech Recognition

Recent Activity

gagan3012 authored a paper 15 days ago

DateLogicQA: Benchmarking Temporal Biases in Large Language Models

gagan3012 authored a paper about 2 months ago

Swan and ArabicMTEB: Dialect-Aware, Arabic-Centric, Cross-Lingual, and Cross-Cultural Embedding Models and Benchmarks

reach-vb authored a paper 2 months ago

Lina-Speech: Gated Linear Attention is a Fast and Parameter-Efficient Learner for text-to-speech synthesis

View all activity

speech-recognition-community-v2's activity

anton-l

posted an update 16 days ago

Post

2104

Introducing 📐𝐅𝐢𝐧𝐞𝐌𝐚𝐭𝐡: the best public math pre-training dataset with 50B+ tokens!
HuggingFaceTB/finemath

Math remains challenging for LLMs and by training on FineMath we see considerable gains over other math datasets, especially on GSM8K and MATH.

We build the dataset by:
🛠️ carefully extracting math data from Common Crawl;
🔎 iteratively filtering and recalling high quality math pages using a classifier trained on synthetic annotations to identify math reasoning and deduction.

We conducted a series of ablations comparing the performance of Llama-3.2-3B-Base after continued pre-training on FineMath and observe notable gains compared to the baseline model and other public math datasets.

We hope this helps advance the performance of LLMs on math and reasoning! 🚀
We’re also releasing all the ablation models as well as the evaluation code.

HuggingFaceTB/finemath-6763fb8f71b6439b653482c2

Rolv-Arild

authored a paper 19 days ago

The Impact of Copyrighted Material on Large Language Models: A Norwegian Perspective

Paper • 2412.09460 • Published 22 days ago • 5

versae

authored a paper 22 days ago

The Impact of Copyrighted Material on Large Language Models: A Norwegian Perspective

Paper • 2412.09460 • Published 22 days ago • 5

DrishtiSharma

authored a paper 23 days ago

1-800-SHARED-TASKS at RegNLP: Lexical Reranking of Semantic Retrieval (LeSeR) for Regulatory Question Answering

Paper • 2412.06009 • Published 26 days ago

DrishtiSharma

authored a paper 24 days ago

Maya: An Instruction Finetuned Multilingual Multimodal Model

Paper • 2412.07112 • Published 25 days ago • 25

DrishtiSharma

authored 3 papers about 1 month ago

INCLUDE: Evaluating Multilingual Language Understanding with Regional Knowledge

Paper • 2411.19799 • Published Nov 29, 2024 • 11

SeQwen at the Financial Misinformation Detection Challenge Task: Sequential Learning for Claim Verification and Explanation Generation in Financial Domains

Paper • 2412.00549 • Published Nov 30, 2024 • 1

1-800-SHARED-TASKS @ NLU of Devanagari Script Languages: Detection of Language, Hate Speech, and Targets using LLMs

Paper • 2411.06850 • Published Nov 11, 2024 • 3

w11wo

authored a paper 2 months ago

GenUP: Generative User Profilers as In-Context Learners for Next POI Recommender Systems

Paper • 2410.20643 • Published Oct 28, 2024

DrishtiSharma

authored a paper 2 months ago

M-RewardBench: Evaluating Reward Models in Multilingual Settings

Paper • 2410.15522 • Published Oct 20, 2024 • 11

pere

authored 4 papers 3 months ago

Operationalizing a National Digital Library: The Case for a Norwegian Transformer Model

Paper • 2104.09617 • Published Apr 19, 2021

Boosting Norwegian Automatic Speech Recognition

Paper • 2307.01672 • Published Jul 4, 2023 • 1

Whispering in Norwegian: Navigating Orthographic and Dialectic Challenges

Paper • 2402.01917 • Published Feb 2, 2024

COVID-Twitter-BERT: A Natural Language Processing Model to Analyse COVID-19 Content on Twitter

Paper • 2005.07503 • Published May 15, 2020

patrickvonplaten

authored a paper 3 months ago

Pixtral 12B

Paper • 2410.07073 • Published Oct 9, 2024 • 62

PereLluis13

authored a paper 3 months ago

RED$^{\rm FM}$: a Filtered and Multilingual Relation Extraction Dataset

Paper • 2306.09802 • Published Jun 16, 2023 • 4

tuananh7198

authored a paper 4 months ago

TinyGSM: achieving >80% on GSM8k with small language models

Paper • 2312.09241 • Published Dec 14, 2023 • 37

versae

authored a paper 4 months ago

Whispering in Norwegian: Navigating Orthographic and Dialectic Challenges

Paper • 2402.01917 • Published Feb 2, 2024

PereLluis13

authored a paper 5 months ago

ReLiK: Retrieve and LinK, Fast and Accurate Entity Linking and Relation Extraction on an Academic Budget

Paper • 2408.00103 • Published Jul 31, 2024 • 18

mrm8488

posted an update 6 months ago

Post

4775

🚨Exciting news for the Multilingual Synthetic Data Community!🚨

I’ve taken inspiration from the MAGPIE paper on Llama-3-8B-instruct and extended its capabilities. Here’s what’s new!

🗞 The MAGPIE paper showcased that if you use the instruction-tuned version (Llama-3-8B-instruct) to generate synthetic instructions and then fine-tune the base version (Llama-3-8B) on this dataset, you can improve even the it-tuned version

🤔 While reading a script by Sebastian Raschka, PhD, I wondered: Could these advancements be replicated in other languages? Specifically, could they benefit non-English datasets?

🎉 And the answer is YES! At least for Spanish. I've successfully adapted the techniques for Spanish, proving the model's flexibility and multilingual capabilities.

👩‍💻 To make this accessible, I created a basic script (heavily inspired by the Sebastian Raschka one) that allows you to generate similar datasets using ollama models (initially phi and llama3) automatically and upload it to the Hugging Face Hub!
[Script](https://gist.github.com/mrm8488/4650a5e3cc45523798a527a3446eb312)

🔍 Explore the datasets 📚 generated using our new script!

- [Llama-3-8B](https://huggingface.co/datasets/mrm8488/dataset_llama3_5000_samples_es_4231_filtered)
- [Phi-3-medium](https://huggingface.co/datasets/mrm8488/dataset_phi3-medium_5000_samples_es_3906_filtered)
- [Phi-3-mini](https://huggingface.co/datasets/mrm8488/dataset_phi3_5000_samples_es_3282_filtered)

Note: These datasets have basic filtering. Apply additional quality filters before using them to fine-tune large language models.

Inspiration and base script:
https://github.com/rasbt/LLMs-from-scratch/blob/main/ch07/05_dataset-generation/llama3-ollama.ipynb
https://www.linkedin.com/feed/update/urn:li:activity:7210982019751661568/

7 replies

AI & ML interests

Recent Activity

Team members 199

speech-recognition-community-v2's activity