irene93/Llama-3-deidentifier

Model Description

This repository hosts the 'LLaMA-3-8B-Instructed' model fine-tuned for the specific task of de-identifying personal information in multi-turn conversations. It effectively removes sensitive data such as names, contact information, and account numbers, enhancing privacy and compliance with data protection regulations.

Training Data

The model was fine-tuned on the 'irene93/deidentify-chat-ko' dataset, which includes Korean chat messages that have been anonymized to remove personal identifiable information (PII). This dataset is ideal for training models to handle and protect sensitive information in text. More details on the dataset can be found on its Hugging Face repository.

How to use

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("irene93/Llama-3-deidentifier")
tokenizer = AutoTokenizer.from_pretrained("irene93/Llama-3-deidentifier")

test_converse = '''고객: 안녕하세요, 제 이름은 김지수입니다. 얼마 전에 주문한 제품이 아직 도착하지 않았습니다. 제 주문번호는 SH12345이고, 저희 집 주소는 서울시 강남구 테헤란로 123-45입니다. 확인 부탁드립니다.

상담원: 안녕하세요, 김지수 고객님. 저희 쇼핑몰을 이용해 주셔서 감사합니다. 고객님의 주문 상황을 확인하고 있습니다. 잠시만 기다려 주시겠어요?

상담원: 확인 결과, 고객님의 주문은 택배사의 물류 지연으로 인해 배송이 지연되고 있습니다. 통상적으로 2~3일 내에는 배송될 예정입니다. 불편을 드려 죄송합니다.

고객: 지연 사유를 알려주셔서 감사합니다. 배송 예정일을 알 수 있을까요?

상담원: 네, 현재로서는 목요일까지는 도착할 것으로 예상됩니다. 도착하면 고객님께 연락드리겠습니다. 연락 받으실 전화번호가 010-1234-5678이 맞으신가요?

고객: 네, 맞습니다. 그 전화번호로 알려주세요.

상담원: 확인했습니다. 만약 배송 관련 추가 문의사항이 있으시면 언제든지 연락주세요. 혹시 환불이 필요하시다면 연결된 계좌로 환불 처리해 드릴 수 있습니다. 계좌번호는 하나은행 123-456-789012로 확인됩니다.

고객: 네, 감사합니다. 그럼 기다리겠습니다.

상담원: 감사합니다, 김지수 고객님. 불편을 드려 다시 한번 사과드립니다. 좋은 하루 보내세요!'''

messages = [
    {"role": "system", "content": "당신은 개인정보를 감춰주는 로봇입니다.\n\n## 지시 사항 ##\n1.주어진 대화에서 사람이름을 [PERSON1], [PERSON2] 등으로 등장 순서에 따라 대체하고, 동일한 이름이 반복될 경우 같은 대치어를 사용합니다.\n2.연락처, 이메일, 주소 , 계좌번호도 각각 [CONTACT1], [CONTACT2] 등, [EMAIL1],[EMAIL2] 등, [ADDRESS1],[ADDRESS2]등 , [ACCOUNT1], [ACCOUNT2] 등 으로 대치하고 동일한 정보가 반복되는 경우에는 같은 대치어를 사용합니다.\n3.대치어를 작성할때 글머리 기호나, 나열식 방식을 쓰지말고 평문으로 이어서 쓰십시오 \n4.위 규칙은 대화 전체에 걸쳐 일관되게 적용합니다. \n당신이 개인정보를 감출 대화내역입니다."},
    {"role": "user", "content": f"입력: {test_converse}"}
]

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

outputs = model.generate(
    input_ids,
    max_new_tokens=2048,
    eos_token_id=terminators,
    do_sample=False,
)

response = outputs[0][input_ids.shape[-1]:]
print(tokenizer.decode(response, skip_special_tokens=True))

irene93
/

Llama-3-deidentifier

Model Description

Training Data

How to use

Model tree for irene93/Llama-3-deidentifier

Dataset used to train irene93/Llama-3-deidentifier