Model Description

This repository hosts the 'LLaMA-3-8B-Instructed' model fine-tuned for the specific task of de-identifying personal information in multi-turn conversations. It effectively removes sensitive data such as names, contact information, and account numbers, enhancing privacy and compliance with data protection regulations.

Training Data

The model was fine-tuned on the 'irene93/deidentify-chat-ko' dataset, which includes Korean chat messages that have been anonymized to remove personal identifiable information (PII). This dataset is ideal for training models to handle and protect sensitive information in text. More details on the dataset can be found on its Hugging Face repository.

How to use

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("irene93/Llama-3-deidentifier")
tokenizer = AutoTokenizer.from_pretrained("irene93/Llama-3-deidentifier")

test_converse = '''고객: μ•ˆλ…•ν•˜μ„Έμš”, 제 이름은 κΉ€μ§€μˆ˜μž…λ‹ˆλ‹€. μ–Όλ§ˆ 전에 μ£Όλ¬Έν•œ μ œν’ˆμ΄ 아직 λ„μ°©ν•˜μ§€ μ•Šμ•˜μŠ΅λ‹ˆλ‹€. 제 μ£Όλ¬Έλ²ˆν˜ΈλŠ” SH12345이고, 저희 집 μ£Όμ†ŒλŠ” μ„œμšΈμ‹œ 강남ꡬ ν…Œν—€λž€λ‘œ 123-45μž…λ‹ˆλ‹€. 확인 λΆ€νƒλ“œλ¦½λ‹ˆλ‹€.

상담원: μ•ˆλ…•ν•˜μ„Έμš”, κΉ€μ§€μˆ˜ κ³ κ°λ‹˜. 저희 μ‡Όν•‘λͺ°μ„ μ΄μš©ν•΄ μ£Όμ…”μ„œ κ°μ‚¬ν•©λ‹ˆλ‹€. κ³ κ°λ‹˜μ˜ μ£Όλ¬Έ 상황을 ν™•μΈν•˜κ³  μžˆμŠ΅λ‹ˆλ‹€. μž μ‹œλ§Œ κΈ°λ‹€λ € μ£Όμ‹œκ² μ–΄μš”?

상담원: 확인 κ²°κ³Ό, κ³ κ°λ‹˜μ˜ 주문은 νƒλ°°μ‚¬μ˜ λ¬Όλ₯˜ μ§€μ—°μœΌλ‘œ 인해 배솑이 μ§€μ—°λ˜κ³  μžˆμŠ΅λ‹ˆλ‹€. ν†΅μƒμ μœΌλ‘œ 2~3일 λ‚΄μ—λŠ” 배솑될 μ˜ˆμ •μž…λ‹ˆλ‹€. λΆˆνŽΈμ„ λ“œλ € μ£„μ†‘ν•©λ‹ˆλ‹€.

고객: 지연 μ‚¬μœ λ₯Ό μ•Œλ €μ£Όμ…”μ„œ κ°μ‚¬ν•©λ‹ˆλ‹€. 배솑 μ˜ˆμ •μΌμ„ μ•Œ 수 μžˆμ„κΉŒμš”?

상담원: λ„€, ν˜„μž¬λ‘œμ„œλŠ” λͺ©μš”μΌκΉŒμ§€λŠ” 도착할 κ²ƒμœΌλ‘œ μ˜ˆμƒλ©λ‹ˆλ‹€. λ„μ°©ν•˜λ©΄ κ³ κ°λ‹˜κ»˜ μ—°λ½λ“œλ¦¬κ² μŠ΅λ‹ˆλ‹€. 연락 λ°›μœΌμ‹€ μ „ν™”λ²ˆν˜Έκ°€ 010-1234-5678이 λ§žμœΌμ‹ κ°€μš”?

고객: λ„€, λ§žμŠ΅λ‹ˆλ‹€. κ·Έ μ „ν™”λ²ˆν˜Έλ‘œ μ•Œλ €μ£Όμ„Έμš”.

상담원: ν™•μΈν–ˆμŠ΅λ‹ˆλ‹€. λ§Œμ•½ 배솑 κ΄€λ ¨ μΆ”κ°€ λ¬Έμ˜μ‚¬ν•­μ΄ μžˆμœΌμ‹œλ©΄ μ–Έμ œλ“ μ§€ μ—°λ½μ£Όμ„Έμš”. ν˜Ήμ‹œ ν™˜λΆˆμ΄ ν•„μš”ν•˜μ‹œλ‹€λ©΄ μ—°κ²°λœ κ³„μ’Œλ‘œ ν™˜λΆˆ μ²˜λ¦¬ν•΄ λ“œλ¦΄ 수 μžˆμŠ΅λ‹ˆλ‹€. κ³„μ’Œλ²ˆν˜ΈλŠ” ν•˜λ‚˜μ€ν–‰ 123-456-789012둜 ν™•μΈλ©λ‹ˆλ‹€.

고객: λ„€, κ°μ‚¬ν•©λ‹ˆλ‹€. 그럼 κΈ°λ‹€λ¦¬κ² μŠ΅λ‹ˆλ‹€.

상담원: κ°μ‚¬ν•©λ‹ˆλ‹€, κΉ€μ§€μˆ˜ κ³ κ°λ‹˜. λΆˆνŽΈμ„ λ“œλ € λ‹€μ‹œ ν•œλ²ˆ μ‚¬κ³Όλ“œλ¦½λ‹ˆλ‹€. 쒋은 ν•˜λ£¨ λ³΄λ‚΄μ„Έμš”!'''

messages = [
    {"role": "system", "content": "당신은 κ°œμΈμ •λ³΄λ₯Ό κ°μΆ°μ£ΌλŠ” λ‘œλ΄‡μž…λ‹ˆλ‹€.\n\n## μ§€μ‹œ 사항 ##\n1.주어진 λŒ€ν™”μ—μ„œ μ‚¬λžŒμ΄λ¦„μ„ [PERSON1], [PERSON2] λ“±μœΌλ‘œ λ“±μž₯ μˆœμ„œμ— 따라 λŒ€μ²΄ν•˜κ³ , λ™μΌν•œ 이름이 반볡될 경우 같은 λŒ€μΉ˜μ–΄λ₯Ό μ‚¬μš©ν•©λ‹ˆλ‹€.\n2.μ—°λ½μ²˜, 이메일, μ£Όμ†Œ , κ³„μ’Œλ²ˆν˜Έλ„ 각각 [CONTACT1], [CONTACT2] λ“±, [EMAIL1],[EMAIL2] λ“±, [ADDRESS1],[ADDRESS2]λ“± , [ACCOUNT1], [ACCOUNT2] λ“± 으둜 λŒ€μΉ˜ν•˜κ³  λ™μΌν•œ 정보가 λ°˜λ³΅λ˜λŠ” κ²½μš°μ—λŠ” 같은 λŒ€μΉ˜μ–΄λ₯Ό μ‚¬μš©ν•©λ‹ˆλ‹€.\n3.λŒ€μΉ˜μ–΄λ₯Ό μž‘μ„±ν• λ•Œ 글머리 κΈ°ν˜Έλ‚˜, λ‚˜μ—΄μ‹ 방식을 쓰지말고 ν‰λ¬ΈμœΌλ‘œ μ΄μ–΄μ„œ μ“°μ‹­μ‹œμ˜€ \n4.μœ„ κ·œμΉ™μ€ λŒ€ν™” 전체에 걸쳐 μΌκ΄€λ˜κ²Œ μ μš©ν•©λ‹ˆλ‹€. \n당신이 κ°œμΈμ •λ³΄λ₯Ό 감좜 λŒ€ν™”λ‚΄μ—­μž…λ‹ˆλ‹€."},
    {"role": "user", "content": f"μž…λ ₯: {test_converse}"}
]

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

outputs = model.generate(
    input_ids,
    max_new_tokens=2048,
    eos_token_id=terminators,
    do_sample=False,
)

response = outputs[0][input_ids.shape[-1]:]
print(tokenizer.decode(response, skip_special_tokens=True))
Downloads last month
8
Safetensors
Model size
8.03B params
Tensor type
FP16
Β·
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and HF Inference API was unable to determine this model's library.

Model tree for irene93/Llama-3-deidentifier

Finetuned
(781)
this model

Dataset used to train irene93/Llama-3-deidentifier