--- language: - fr - en tags: - Question generation datasets: - FQuAD - PIAF - SQuAD metrics: - rouge - BERTScore (cammenbert) widget: - text: "The period from the death of Alexander the Great in 323 until the death of Cleopatra VII, the last Macedonian ruler of Egypt, is known as the Hellenistic period. In the early part of this period, a new form of kingship developed based on Macedonian and Near Eastern traditions. The first Hellenistic kings were previously Alexander's generals, and took power in the period following his death, though they were not part of existing royal lineages and lacked historic claims to the territories they controlled." --- # A Model for multiple questions generation for french and english languages ## Training The model has been trained on different french and english corpus (FQuAD, PIAF and SQuAD) where for each paragraph the objective is to predict all the possible questions. We are using the mbart model `facebook/mbart-large-50-many-to-many-mmt`, notice that it would works better with MBarthez like models for the french generation (MBarthez finetuning will be uploaded later). For all dataset we translate queries (in french or in english) but we always preserve paragraph in its original language, thus the model can create english questions with french paragraph. We trained the model during 12 epochs considering an epochs as 2000 batches of 128 (considering gradient accumulation). ## Generate ```python from transformers import AutoTokenizer, AutoModelForSeq2SeqLM access_token = "hf_......" # Loading the model weights and the tokenizer tokenizer = AutoTokenizer.from_pretrained("ThomasGerald/mbart-multi-question-generation", use_auth_token=access_token) model = AutoModelForSeq2SeqLM.from_pretrained("ThomasGerald/mbart-multi-question-generation", use_auth_token=access_token) # For the exemple we give the following text talking about origin of the grec language text = ("La recherche moderne considère généralement que la langue grecque n'est pas née en Grèce," + "mais elle n'est pas arrivée à un consensus quant à la date d'arrivée des groupes parlant un "+ "« proto-grec », qui s'est produite durant des phases préhistoriques pour lesquelles il n'y a"+ "pas de texte indiquant quelles langues étaient parlées. Les premiers textes écrits en grec sont"+ "les tablettes en linéaire B de l'époque mycénienne, au XIVe siècle av. J.-C., ce qui indique que"+ "des personnes parlant un dialecte grec sont présentes en Grèce au plus tard durant cette période."+ " La linguistique n'est pas en mesure de trancher, pas plus que l'archéologie.") # We specify the input languages and tokenize tokenizer.set_src_lang_special_tokens("fr_XX") tokenized_text = tokenizer([text], return_tensors="pt") ########## FRENCH DECODING # We generate given the output language code output = model.generate(**tokenized_text, forced_bos_token_id=tokenizer.lang_code_to_id['fr_XX']) # We show the outpu of the model (sequence of questions separated by the token ) tokenizer.batch_decode(output, skip_special_tokens=False) #### output : '''["fr_XX Quelle est l'origine de la langue grecque selon la recherche moderne? Quelle est la date d'arrivée des grecs proto-grecs? Où sont les premiers textes écrits en grec? De quand date l'époque mycénienne? Qu'est ce que la linguistique n'est pas en mesure de faire?"] ''' ######### ENGLISH DECODING # We generate given the output language code output = model.generate(**tokenized_text, forced_bos_token_id=tokenizer.lang_code_to_id['en_XX']) # We show the outpu of the model (sequence of questions separated by the token ) tokenizer.batch_decode(output, skip_special_tokens=False) #### output : '''["en_XX What is the origin of the Greek language according to modern research? When did the prehistoric phases take place? What are the first texts written in Greek? When do the first written texts in Greek date back? What is the only science that can't decide the dates of the first Greek texts?"] ''' ```