Spaces:

Anvilogic
/

README

Running

App Files Files Community

anvilogic-admin commited on Nov 11, 2024

Commit

8c0274d

verified ·

1 Parent(s): 3c10d28

Update README.md

Browse files

Files changed (1) hide show

README.md +12 -13

README.md CHANGED Viewed

@@ -12,24 +12,23 @@ pinned: false
 Welcome to the official Hugging Face organization for [Anvilogic's](https://www.anvilogic.com/) advanced cybersecurity AI models!
 Founded in 2019, [Anvilogic](https://www.anvilogic.com/) specializes in AI-driven threat detection and automation, enhancing Security Operations Center (SOC) capabilities with scalable, data-driven solutions.
-## Typosquatting collection
-  Typosquatting is a form of cyber attack where malicious actors create fake domain names that are visually or phonetically similar to legitimate domains, intending to deceive users into visiting these sites.
-  This collection aims at detecting typosquatted domains by identifying and flagging such domains :
-  It is comprised of the following:
 ### Models
-- **Embedders :** This model provides representation for domain names. This is used to mine similar domains.  This model exists both based on RoBERTa model (with BPE tokenization) and CANINE-c (with character-level encoding)
-- **Cross-Encoders :** This model is able to compare two domain names and conclude if one domain is a typosquat of another.  This model exists both based on RoBERTa model (with BPE tokenization) and CANINE-c (with character-level encoding)
-- **T5 :** This model is a derived version of T5 trained on a new task, with the prefix : "Is the first domain a typosquat of the second : " to which we append *TYPOSQUAT_DOMAIN* and *LEGITIMATE_DOMAIN*
 ### Datasets
-- **Embedder training dataset :** Dataset formatted to train embedding model with (Anchor,Positive) pairs
-- **Cross-Encoder training dataset :** Dataset formatted to train Cross-encoder model with (Anchor,Positive,label) samples.
-- **T5 training dataset :** Dataset formatted to train T5 model with (prompt,response) pairs .
 ### Spaces
-- **Embedder Typosquat Detect :** Allows the user to retrieve most similar domains from a pool of 4000 most common domains.
-- **CE Typosquat Detect :** Allows the user to compare two domains using Cross-encoders.The model outputs of a probability of typosquatting.
-- **T5 Typosquat Detect :** Allows the user to compare two domains using T5. The model outputs a boolean.

 Welcome to the official Hugging Face organization for [Anvilogic's](https://www.anvilogic.com/) advanced cybersecurity AI models!
 Founded in 2019, [Anvilogic](https://www.anvilogic.com/) specializes in AI-driven threat detection and automation, enhancing Security Operations Center (SOC) capabilities with scalable, data-driven solutions.
+## Typosquatting Collection
+Typosquatting is a form of cyber attack where malicious actors create fake domain names that are visually or phonetically similar to legitimate domains, intending to deceive users into visiting these sites. This collection aims to detect typosquatted domains by identifying and flagging them. It is comprised of the following:
 ### Models
+- **Embedder**: This model provides a representation for domain names and is used to mine similar domains. It is available in both a RoBERTa-based version (with BPE tokenization) and a CANINE-c version (with character-level encoding).
+- **Cross-Encoder**: This model can compare two domain names and determine if one domain is a typosquat of another. It is available in both a RoBERTa-based version (with BPE tokenization) and a CANINE-c version (with character-level encoding).
+- **T5 Typosquat Detection**: This model is a derived version of T5 trained on a new task, with the prefix "Is the first domain a typosquat of the second:" to which we append *TYPOSQUAT_DOMAIN* and *LEGITIMATE_DOMAIN*.
 ### Datasets
+- **Embedder Training Dataset**: A dataset formatted to train the embedding model, containing pairs of (Anchor,Positive) domain examples.
+- **Cross-Encoder Training Dataset**: A dataset formatted to train the Cross-Encoder model with (Anchor,Positive,label) samples.
+- **T5 Training Dataset**: A dataset formatted to train the T5 model with (prompt,response) pairs.
 ### Spaces
+- **Embedder Typosquat Detect**: Allows users to retrieve the most similar domains from a pool of 4,000 of the most common domains.
+- **Cross-Encoder (CE) Typosquat Detect**: Allows users to compare two domains using the Cross-Encoder. The model outputs a probability of typosquatting.
+- **T5 Typosquat Detect**: Allows users to compare two domains using the T5 model. The model outputs a boolean value indicating whether the domain is a typosquat.