Model Description
This repository hosts the 'LLaMA-3-8B-Instructed' model fine-tuned for the specific task of de-identifying personal information in multi-turn conversations. It effectively removes sensitive data such as names, contact information, and account numbers, enhancing privacy and compliance with data protection regulations.
Training Data
The model was fine-tuned on the 'irene93/deidentify-chat-ko' dataset, which includes Korean chat messages that have been anonymized to remove personal identifiable information (PII). This dataset is ideal for training models to handle and protect sensitive information in text. More details on the dataset can be found on its Hugging Face repository.
How to use
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("irene93/Llama-3-deidentifier")
tokenizer = AutoTokenizer.from_pretrained("irene93/Llama-3-deidentifier")
test_converse = '''κ³ κ°: μλ
νμΈμ, μ μ΄λ¦μ κΉμ§μμ
λλ€. μΌλ§ μ μ μ£Όλ¬Έν μ νμ΄ μμ§ λμ°©νμ§ μμμ΅λλ€. μ μ£Όλ¬Έλ²νΈλ SH12345μ΄κ³ , μ ν¬ μ§ μ£Όμλ μμΈμ κ°λ¨κ΅¬ ν
ν€λλ‘ 123-45μ
λλ€. νμΈ λΆνλ립λλ€.
μλ΄μ: μλ
νμΈμ, κΉμ§μ κ³ κ°λ. μ ν¬ μΌνλͺ°μ μ΄μ©ν΄ μ£Όμ
μ κ°μ¬ν©λλ€. κ³ κ°λμ μ£Όλ¬Έ μν©μ νμΈνκ³ μμ΅λλ€. μ μλ§ κΈ°λ€λ € μ£Όμκ² μ΄μ?
μλ΄μ: νμΈ κ²°κ³Ό, κ³ κ°λμ μ£Όλ¬Έμ νλ°°μ¬μ λ¬Όλ₯ μ§μ°μΌλ‘ μΈν΄ λ°°μ‘μ΄ μ§μ°λκ³ μμ΅λλ€. ν΅μμ μΌλ‘ 2~3μΌ λ΄μλ λ°°μ‘λ μμ μ
λλ€. λΆνΈμ λλ € μ£μ‘ν©λλ€.
κ³ κ°: μ§μ° μ¬μ λ₯Ό μλ €μ£Όμ
μ κ°μ¬ν©λλ€. λ°°μ‘ μμ μΌμ μ μ μμκΉμ?
μλ΄μ: λ€, νμ¬λ‘μλ λͺ©μμΌκΉμ§λ λμ°©ν κ²μΌλ‘ μμλ©λλ€. λμ°©νλ©΄ κ³ κ°λκ» μ°λ½λλ¦¬κ² μ΅λλ€. μ°λ½ λ°μΌμ€ μ νλ²νΈκ° 010-1234-5678μ΄ λ§μΌμ κ°μ?
κ³ κ°: λ€, λ§μ΅λλ€. κ·Έ μ νλ²νΈλ‘ μλ €μ£ΌμΈμ.
μλ΄μ: νμΈνμ΅λλ€. λ§μ½ λ°°μ‘ κ΄λ ¨ μΆκ° λ¬Έμμ¬νμ΄ μμΌμλ©΄ μΈμ λ μ§ μ°λ½μ£ΌμΈμ. νΉμ νλΆμ΄ νμνμλ€λ©΄ μ°κ²°λ κ³μ’λ‘ νλΆ μ²λ¦¬ν΄ λ릴 μ μμ΅λλ€. κ³μ’λ²νΈλ νλμν 123-456-789012λ‘ νμΈλ©λλ€.
κ³ κ°: λ€, κ°μ¬ν©λλ€. κ·ΈλΌ κΈ°λ€λ¦¬κ² μ΅λλ€.
μλ΄μ: κ°μ¬ν©λλ€, κΉμ§μ κ³ κ°λ. λΆνΈμ λλ € λ€μ νλ² μ¬κ³Όλ립λλ€. μ’μ ν루 보λ΄μΈμ!'''
messages = [
{"role": "system", "content": "λΉμ μ κ°μΈμ 보λ₯Ό κ°μΆ°μ£Όλ λ‘λ΄μ
λλ€.\n\n## μ§μ μ¬ν ##\n1.μ£Όμ΄μ§ λνμμ μ¬λμ΄λ¦μ [PERSON1], [PERSON2] λ±μΌλ‘ λ±μ₯ μμμ λ°λΌ λ체νκ³ , λμΌν μ΄λ¦μ΄ λ°λ³΅λ κ²½μ° κ°μ λμΉμ΄λ₯Ό μ¬μ©ν©λλ€.\n2.μ°λ½μ², μ΄λ©μΌ, μ£Όμ , κ³μ’λ²νΈλ κ°κ° [CONTACT1], [CONTACT2] λ±, [EMAIL1],[EMAIL2] λ±, [ADDRESS1],[ADDRESS2]λ± , [ACCOUNT1], [ACCOUNT2] λ± μΌλ‘ λμΉνκ³ λμΌν μ λ³΄κ° λ°λ³΅λλ κ²½μ°μλ κ°μ λμΉμ΄λ₯Ό μ¬μ©ν©λλ€.\n3.λμΉμ΄λ₯Ό μμ±ν λ κΈλ¨Έλ¦¬ κΈ°νΈλ, λμ΄μ λ°©μμ μ°μ§λ§κ³ νλ¬ΈμΌλ‘ μ΄μ΄μ μ°μμμ€ \n4.μ κ·μΉμ λν μ 체μ κ±Έμ³ μΌκ΄λκ² μ μ©ν©λλ€. \nλΉμ μ΄ κ°μΈμ 보λ₯Ό κ°μΆ λνλ΄μμ
λλ€."},
{"role": "user", "content": f"μ
λ ₯: {test_converse}"}
]
input_ids = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_tensors="pt"
).to(model.device)
terminators = [
tokenizer.eos_token_id,
tokenizer.convert_tokens_to_ids("<|eot_id|>")
]
outputs = model.generate(
input_ids,
max_new_tokens=2048,
eos_token_id=terminators,
do_sample=False,
)
response = outputs[0][input_ids.shape[-1]:]
print(tokenizer.decode(response, skip_special_tokens=True))
- Downloads last month
- 8
Inference Providers
NEW
This model is not currently available via any of the supported third-party Inference Providers, and
HF Inference API was unable to determine this model's library.
Model tree for irene93/Llama-3-deidentifier
Base model
meta-llama/Llama-3.1-8B
Finetuned
meta-llama/Llama-3.1-8B-Instruct