anvilogic-admin commited on
Commit
8c0274d
·
verified ·
1 Parent(s): 3c10d28

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +12 -13
README.md CHANGED
@@ -12,24 +12,23 @@ pinned: false
12
  Welcome to the official Hugging Face organization for [Anvilogic's](https://www.anvilogic.com/) advanced cybersecurity AI models!
13
  Founded in 2019, [Anvilogic](https://www.anvilogic.com/) specializes in AI-driven threat detection and automation, enhancing Security Operations Center (SOC) capabilities with scalable, data-driven solutions.
14
 
15
- ## Typosquatting collection
16
- Typosquatting is a form of cyber attack where malicious actors create fake domain names that are visually or phonetically similar to legitimate domains, intending to deceive users into visiting these sites.
17
- This collection aims at detecting typosquatted domains by identifying and flagging such domains :
18
- It is comprised of the following:
19
 
20
  ### Models
21
 
22
- - **Embedders :** This model provides representation for domain names. This is used to mine similar domains. This model exists both based on RoBERTa model (with BPE tokenization) and CANINE-c (with character-level encoding)
23
- - **Cross-Encoders :** This model is able to compare two domain names and conclude if one domain is a typosquat of another. This model exists both based on RoBERTa model (with BPE tokenization) and CANINE-c (with character-level encoding)
24
- - **T5 :** This model is a derived version of T5 trained on a new task, with the prefix : "Is the first domain a typosquat of the second : " to which we append *TYPOSQUAT_DOMAIN* and *LEGITIMATE_DOMAIN*
25
 
26
  ### Datasets
27
 
28
- - **Embedder training dataset :** Dataset formatted to train embedding model with (Anchor,Positive) pairs
29
- - **Cross-Encoder training dataset :** Dataset formatted to train Cross-encoder model with (Anchor,Positive,label) samples.
30
- - **T5 training dataset :** Dataset formatted to train T5 model with (prompt,response) pairs .
31
 
32
  ### Spaces
33
- - **Embedder Typosquat Detect :** Allows the user to retrieve most similar domains from a pool of 4000 most common domains.
34
- - **CE Typosquat Detect :** Allows the user to compare two domains using Cross-encoders.The model outputs of a probability of typosquatting.
35
- - **T5 Typosquat Detect :** Allows the user to compare two domains using T5. The model outputs a boolean.
 
 
12
  Welcome to the official Hugging Face organization for [Anvilogic's](https://www.anvilogic.com/) advanced cybersecurity AI models!
13
  Founded in 2019, [Anvilogic](https://www.anvilogic.com/) specializes in AI-driven threat detection and automation, enhancing Security Operations Center (SOC) capabilities with scalable, data-driven solutions.
14
 
15
+ ## Typosquatting Collection
16
+ Typosquatting is a form of cyber attack where malicious actors create fake domain names that are visually or phonetically similar to legitimate domains, intending to deceive users into visiting these sites. This collection aims to detect typosquatted domains by identifying and flagging them. It is comprised of the following:
 
 
17
 
18
  ### Models
19
 
20
+ - **Embedder**: This model provides a representation for domain names and is used to mine similar domains. It is available in both a RoBERTa-based version (with BPE tokenization) and a CANINE-c version (with character-level encoding).
21
+ - **Cross-Encoder**: This model can compare two domain names and determine if one domain is a typosquat of another. It is available in both a RoBERTa-based version (with BPE tokenization) and a CANINE-c version (with character-level encoding).
22
+ - **T5 Typosquat Detection**: This model is a derived version of T5 trained on a new task, with the prefix "Is the first domain a typosquat of the second:" to which we append *TYPOSQUAT_DOMAIN* and *LEGITIMATE_DOMAIN*.
23
 
24
  ### Datasets
25
 
26
+ - **Embedder Training Dataset**: A dataset formatted to train the embedding model, containing pairs of (Anchor,Positive) domain examples.
27
+ - **Cross-Encoder Training Dataset**: A dataset formatted to train the Cross-Encoder model with (Anchor,Positive,label) samples.
28
+ - **T5 Training Dataset**: A dataset formatted to train the T5 model with (prompt,response) pairs.
29
 
30
  ### Spaces
31
+
32
+ - **Embedder Typosquat Detect**: Allows users to retrieve the most similar domains from a pool of 4,000 of the most common domains.
33
+ - **Cross-Encoder (CE) Typosquat Detect**: Allows users to compare two domains using the Cross-Encoder. The model outputs a probability of typosquatting.
34
+ - **T5 Typosquat Detect**: Allows users to compare two domains using the T5 model. The model outputs a boolean value indicating whether the domain is a typosquat.