CyBERTuned
CyBERTuned is a BERT-like model trained with an NLE (non-linguistic element) aware pretraining method tuned for the cybersecurity domain.
Sample Usage
>>> from transformers import pipeline
>>> folder_dir = "CyBERTuned"
>>> unmasker = pipeline('fill-mask', model=folder_dir)
>>> unmasker("RagnarLocker, LockBit, and REvil are types of <mask>.")
[{'score': 0.8489783406257629, 'token': 25346, 'token_str': ' ransomware', 'sequence': 'RagnarLocker, LockBit, and REvil are types of ransomware.'},
{'score': 0.1364559829235077, 'token': 16886, 'token_str': ' malware', 'sequence': 'RagnarLocker, LockBit, and REvil are types of malware.'},
{'score': 0.0022238395176827908, 'token': 1912, 'token_str': ' attacks', 'sequence': 'RagnarLocker, LockBit, and REvil are types of attacks.'},
{'score': 0.001197474543005228, 'token': 11341, 'token_str': ' infections', 'sequence': 'RagnarLocker, LockBit, and REvil are types of infections.'},
{'score': 0.0009669850114732981, 'token': 6773, 'token_str': ' files', 'sequence': 'RagnarLocker, LockBit, and REvil are types of files.'}]
>>> # text requiring url comprehension (redirection attack), modified from https://intezer.com/blog/research/targeted-phishing-attack-against-ukrainian-government-expands-to-georgia/
>>> url_text = 'The PDF contains an action object. Upon a victim opening the PDF it will send a query to Google: http://www[.]google[.]com/url?q=http%3A%2F%2F9348243249382479234343284324023432748892349702394023.xyz&sa=D&sntz=1&usg=AFQjCNFWmVffgSGlrrv-2U9sSOJYzfUQqw. This link is a typical <mask> attack.'
>>> unmasker(url_text)[0]
{'score': 0.1701660305261612, 'token': 30970, 'token_str': ' redirect', 'sequence': 'The PDF contains an action object. Upon a victim opening the PDF it will send a query to Google: http://www[.]google[.]com/url?q=http%3A%2F%2F9348243249382479234343284324023432748892349702394023.xyz&sa=D&sntz=1&usg=AFQjCNFWmVffgSGlrrv-2U9sSOJYzfUQqw. This link is a typical redirect attack.'}
>>> from transformers import AutoModel, AutoTokenizer
>>> model = AutoModel.from_pretrained(folder_dir)
>>> tokenizer = AutoTokenizer.from_pretrained(folder_dir)
>>> text = "Cybersecurity information is often technically complex and relayed through unstructured text, making automation of cyber threat intelligence highly challenging."
>>> encoded = tokenizer(text, return_tensors="pt")
>>> output = model(**encoded)
>>> output[0].shape
torch.Size([1, 27, 768])
Citation
If you're using CyBERTuned please cite the following paper:
Eugene Jang, Jian Cui, Dayeon Yim, Youngjin Jin, Jin-Woo Chung, Seungwon Shin, and Yongjae Lee. 2024. Ignore Me But Don’t Replace Me: Utilizing Non-Linguistic Elements for Pretraining on the Cybersecurity Domain. In Findings of the Association for Computational Linguistics: NAACL 2024, pages 29–42, Mexico City, Mexico. Association for Computational Linguistics.
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 0.0006
- train_batch_size: 64
- eval_batch_size: 32
- seed: 42
- distributed_type: multi-GPU
- num_devices: 4
- gradient_accumulation_steps: 8
- total_train_batch_size: 2048
- total_eval_batch_size: 128
- optimizer: Adam with betas=(0.9,0.98) and epsilon=1e-06
- lr_scheduler_type: linear
- lr_scheduler_warmup_ratio: 0.048
- num_epochs: 200
Framework versions
- Transformers 4.27.0.dev0
- Pytorch 1.12.1
- Datasets 2.6.1
- Tokenizers 0.13.2
- Downloads last month
- 103
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.