Hi there Mas @cahya, thanks for initiating this thread!
I’m Wilson Wongso (@w11wo in Github and HF), a year-2 Computer Science undergraduate student from Jakarta, Indonesia. I’m still relatively new and am still learning NLP/Hugging Face, but am having fun thus far!
I trained some small language models with GPT-2 and RoBERTa recently as my side project during semester break and am interested to create language models for native Indonesian languages like Javanese, Sundanese, Medanese, etc.
Looking forward to connecting with the community here!
Hi, my name is Cahya. I work as a system and software engineer in Vienna, Austria. My interest in ML / NLP started in early 2017 with a simple text classification with Tensorflow.
Currently I like to experiment with Conversational AI, Open Domain Question Answering, and Text Summarization. I built some Indonesian language models which are hosted here and helped to put some existing Indonesian NLP datasets to the collection of Hf datasets.
I hope we could connect and work together on interesting Indonesian nlp projects. One of the projects I would like to try is creating an MBART model with a collection of some of existing languages ​​in Indonesia (at least the 15 most used one) such as Javanese, Sundanese or Minangkabau. This could later be used for machine translation among these languages ​​or other seq2seq tasks.
Hello guys, my name is warto from IAIN Purwokerto Indonesia. My interest on NLP and text mining. I am doctoral student at Dian Nuswantoro University Semarang. My research topic about information extraction.
I just finished annotate Indonesian news with covid19 topic
Hi, My name is Akmal. My Huggingface, GitHub username is Wikidepia.
I have zero background on NLP / Machine learning
Currently i am interested in creating Indonesia transformer models like T5 and GPT-2. Thanks to TFRC Also translating english dataset like PAWS.
@cahya thx for sharing your models, which variant do you reckon will be the best fit(size, inference speed, classification accuracy) if I were to fine tune to classify address strings in indonesian? currently experimenting with cahya/bert-base-indonesian-522M. Solving a NER problem
Hi @yptheangel
Glad that you want to use my models. I would suggest to use cahya/bert-base-indonesian-1.5G model for classification accuracy since it was trained with more data. If you want to use smaller model with faster inference speed, I would suggest the model cahya/distilbert-base-indonesian, which used cahya/bert-base-indonesian-1.5G as the teacher.
I have also fine tuned this bert model for NER cahya/bert-base-indonesian-NER · Hugging Face, which used the NER dataset id_nergrit_corpus · Datasets at Hugging Face. However, I still need to write model card/documentation about it.
Hi mas Akmal, nice to see you here also. Great that you built also several Indonesian models, I also really appreciate that you created/translated several datasets for Indonesian NLP. If I see your models and the datasets you created, I am not sure if you really have zero background on NLP/Machine Learning
Btw, how long do you still have access to TFRC?
Hi, My name is Reza, and im new to nlp especially utilizing hugging face.
I have a question, is it okay to train language model (like bert) with many typo word (twitter-like sentence) ?
we want to make lm so it can be used for many task, but we need inference time fast enough (<500ms)
Halo teman2, my name is Rapha, (github: github.com/raphaelmerx/, twitter: https://twitter.com/RaphaelMerx/), I’m working on tetun.org, an online translator for the Tetun language (which is the main language spoken in Timor-Leste). Currently tetun.org supports Tetun-English only, but I’d like to add Tetun-Indonesian, since this is a frequent request by the app users.
Like Indonesian, Tetun is an Austronesian language, and so I’m very interested in Indonesian NLP because it’s a “sibling” language to Tetun with a lot more resources.
I speak Indonesian, and would love to get involved in some pure-Indonesian NLP projects, time permits! @cahya I really like your idea of creating an mBART model for the languages of Indonesia, did you get a chance to try it?
Hi Raphael, sorry for late response. Yes, it would be nice if we could have tetun-indonesian, the challenge here is its parallel corpus.
Good that you like the idea for creating mbart for some Indonesian language, unfortunately I still don’t start it yet. Maybe we should collaborate to do it.
Btw, we have a telegram channel discussing Indonesian nlp and huggingface, if you like, I can send you an invitation
Hi Cahya, you’re right about the parallel corpus, I tried to train a multilingual translation model Tetun - English - Indonesian but the Tetun-Indonesian quality was poor for this reason.
Yes I would be happy to join the Telegram channel! Thanks in advance for the invite.
Hi everyone and Pak @cahya!
I just recently found this discussion and am so excited with a lot of like-minded here!
I’m Ivo (@ivokun in Github, HF, or other platforms). I work as an associate research engineer. I started to learn ML (NLP in particular) in 2017 when my previous company try to develop article generation (and failed miserably). Then took a graduate degree with NLG as my main research topic.
After a year of not working with NLP, recently, I can continue my research on NLG again. I tried to train GPT3-XL (with GPTNeo’s script) with the Bahasa Indonesia subset of the Oscar dataset. Just got TRC and want to fully utilize it this week.
Hi @ivokun, nice also to see another Person interested in text generation for Indonesian.
Btw, did you mean gpt2-xl? We had experience to pretrain gpt2-large on 68GB datasets, it took around 6 days for 1 epoch using tpu v3-8. If you want to train gpt2-xl, which is twice of the size of gpt-large, on Indonesian oscar dataset (around 30GB), you would need also around 6 days for an epoch. I am just not sure if you have enough ram for training gpt2-xl on tpu v3-8.