You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Typosquat T5 detector

Model Details

Model Description

This model is an encoder-decoder fine-tuned for to detect typosquatting of domain names, leveraging the flan-t5-large transformer model. The model can be used to classify whether a domain name is a typographical variant (typosquat) of another domain.

  • Developed by: Anvilogic
  • Model type: Encoder-Decoder
  • Maximum Sequence Length: 512 tokens
  • Language(s) (NLP): Multilingual
  • License: MIT
  • Finetuned from model : flan-t5-large

Usage

Direct Usage (Transformers)

This model can be directly used in cybersecurity applications to identify malicious typosquatting domains by analyzing a domain name similarity to a legitimate one.

To start using this model, the following code can be used for loading and testing:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("Anvilogic/Flan-T5-typosquat-detect")
model = AutoModelForSeq2SeqLM.from_pretrained("Anvilogic/Flan-T5-typosquat-detect")

# Example input
typosquat_candidate='goog1e.com'
legitimate_domain='google.com'

input_text = f"Is the first domain a typosquat of the second: {typosquat_candidate} {legitimate_domain}"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0],skip_special_tokens=True))

false

Downstream Usage

This model can be used with an embedding model to enhance typosquatting detection. First, an embedding model retrieves similar domains from a legitimate database. Then, this encoder-decoder labels these pairs, confirming if a domain is a typosquat and identifying its original source.

For embedding, consider using: Anvilogic/Embedder-typosquat-detect

Bias, Risks, and Limitations

Users are advised to use this model as a supportive tool rather than a sole indicator for domain security. Regular updates may be needed to maintain its performance against new and evolving types of domain spoofing.

Training Details

Framework Versions

  • Python: 3.10.14
  • Transformers: 4.46.2
  • PyTorch: 2.2.2
  • Tokenizers: 0.20.3

Training Data

The model was fine-tuned using Anvilogic/T5-Typosquat-Training-Dataset, which contains pairs of domain names and the expected response.

Training Procedure

The model was optimized using the binary cross-entropy loss function with logits, CrossEntropyLoss().

Training Hyperparameters

  • Model Architecture: Encoder-Decoder fine-tuned from flan-t5-large
  • Batch Size: 8
  • Epochs: 5
  • Learning Rate: 5e-5

Evaluation

Training loss

Epoch Training loss Validation loss
Epoch 1 0.0807 0.016496
Epoch 2 0.0270 0.018645
Epoch 3 0.0034 0.016577
Epoch 4 0.0002 0.012842
Epoch 5 0.0407 0.014530

We only kept the fourth checkpoint as it exhibits the best loss.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for Anvilogic/Flan-T5-typosquat-detect

Finetuned
(110)
this model

Dataset used to train Anvilogic/Flan-T5-typosquat-detect