Add BERTopic model
Browse files- README.md +205 -0
- config.json +15 -0
- ctfidf.safetensors +3 -0
- ctfidf_config.json +0 -0
- topic_embeddings.safetensors +3 -0
- topics.json +0 -0
README.md
ADDED
@@ -0,0 +1,205 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
|
2 |
+
---
|
3 |
+
tags:
|
4 |
+
- bertopic
|
5 |
+
library_name: bertopic
|
6 |
+
pipeline_tag: text-classification
|
7 |
+
---
|
8 |
+
|
9 |
+
# BERTopic-booksum-ngram1-sentence-t5-xl-chapter
|
10 |
+
|
11 |
+
This is a [BERTopic](https://github.com/MaartenGr/BERTopic) model.
|
12 |
+
BERTopic is a flexible and modular topic modeling framework that allows for the generation of easily interpretable topics from large datasets.
|
13 |
+
|
14 |
+
## Usage
|
15 |
+
|
16 |
+
To use this model, please install BERTopic:
|
17 |
+
|
18 |
+
```
|
19 |
+
pip install -U bertopic
|
20 |
+
```
|
21 |
+
|
22 |
+
You can use the model as follows:
|
23 |
+
|
24 |
+
```python
|
25 |
+
from bertopic import BERTopic
|
26 |
+
topic_model = BERTopic.load("pszemraj/BERTopic-booksum-ngram1-sentence-t5-xl-chapter")
|
27 |
+
|
28 |
+
topic_model.get_topic_info()
|
29 |
+
```
|
30 |
+
|
31 |
+
## Topic overview
|
32 |
+
|
33 |
+
* Number of topics: 138
|
34 |
+
* Number of training documents: 70840
|
35 |
+
|
36 |
+
<details>
|
37 |
+
<summary>Click here for an overview of all topics.</summary>
|
38 |
+
|
39 |
+
| Topic ID | Topic Keywords | Topic Frequency | Label |
|
40 |
+
|----------|----------------|-----------------|-------|
|
41 |
+
| -1 | were - her - was - had - she | 30 | -1_were_her_was_had |
|
42 |
+
| 0 | were - had - was - could - miss | 28715 | 0_were_had_was_could |
|
43 |
+
| 1 | artagnan - athos - musketeers - porthos - treville | 16916 | 1_artagnan_athos_musketeers_porthos |
|
44 |
+
| 2 | rama - ravan - brahma - lakshman - raghu | 4563 | 2_rama_ravan_brahma_lakshman |
|
45 |
+
| 3 | were - canoe - hist - huron - hutter | 1268 | 3_were_canoe_hist_huron |
|
46 |
+
| 4 | slave - were - slavery - had - was | 1011 | 4_slave_were_slavery_had |
|
47 |
+
| 5 | holmes - sherlock - watson - moor - baskerville | 580 | 5_holmes_sherlock_watson_moor |
|
48 |
+
| 6 | prisoner - milady - felton - were - madame | 549 | 6_prisoner_milady_felton_were |
|
49 |
+
| 7 | coriolanus - cassius - brutus - sicinius - titus | 527 | 7_coriolanus_cassius_brutus_sicinius |
|
50 |
+
| 8 | confederation - constitution - federal - states - senate | 511 | 8_confederation_constitution_federal_states |
|
51 |
+
| 9 | heathcliff - catherine - wuthering - cathy - hindley | 498 | 9_heathcliff_catherine_wuthering_cathy |
|
52 |
+
| 10 | were - seemed - rima - was - had | 492 | 10_were_seemed_rima_was |
|
53 |
+
| 11 | laws - lawes - law - civill - actions | 452 | 11_laws_lawes_law_civill |
|
54 |
+
| 12 | fang - wolf - fangs - musher - growl | 401 | 12_fang_wolf_fangs_musher |
|
55 |
+
| 13 | sigurd - thorgeir - thord - gunnar - skarphedinn | 395 | 13_sigurd_thorgeir_thord_gunnar |
|
56 |
+
| 14 | achilles - troy - patroclus - aeneas - ulysses | 385 | 14_achilles_troy_patroclus_aeneas |
|
57 |
+
| 15 | fogg - passengers - passed - phileas - travellers | 376 | 15_fogg_passengers_passed_phileas |
|
58 |
+
| 16 | troy - trojans - aeneas - fates - trojan | 370 | 16_troy_trojans_aeneas_fates |
|
59 |
+
| 17 | disciples - jesus - pharisees - temple - jerusalem | 340 | 17_disciples_jesus_pharisees_temple |
|
60 |
+
| 18 | helsing - harker - diary - dr - he | 324 | 18_helsing_harker_diary_dr |
|
61 |
+
| 19 | lama - who - no - kim - am | 312 | 19_lama_who_no_kim |
|
62 |
+
| 20 | sara - princess - herself - she - minchin | 301 | 20_sara_princess_herself_she |
|
63 |
+
| 21 | horses - horse - saddle - stable - were | 293 | 21_horses_horse_saddle_stable |
|
64 |
+
| 22 | hester - pearl - scarlet - her - human | 292 | 22_hester_pearl_scarlet_her |
|
65 |
+
| 23 | candide - inquisitor - friar - cunegonde - philosopher | 286 | 23_candide_inquisitor_friar_cunegonde |
|
66 |
+
| 24 | dick - aunt - were - could - had | 275 | 24_dick_aunt_were_could |
|
67 |
+
| 25 | wolves - wolf - cub - hunger - were | 261 | 25_wolves_wolf_cub_hunger |
|
68 |
+
| 26 | god - gods - consequences - satan - som | 241 | 26_god_gods_consequences_satan |
|
69 |
+
| 27 | modesty - women - behaviour - human - woman | 240 | 27_modesty_women_behaviour_human |
|
70 |
+
| 28 | society - education - distribution - service - labour | 240 | 28_society_education_distribution_service |
|
71 |
+
| 29 | siddhartha - buddha - gotama - kamaswami - om | 237 | 29_siddhartha_buddha_gotama_kamaswami |
|
72 |
+
| 30 | ship - captain - aboard - squire - ll | 229 | 30_ship_captain_aboard_squire |
|
73 |
+
| 31 | cyrano - roxane - montfleury - hark - love | 227 | 31_cyrano_roxane_montfleury_hark |
|
74 |
+
| 32 | alice - were - rabbit - hare - hatter | 225 | 32_alice_were_rabbit_hare |
|
75 |
+
| 33 | toto - kansas - dorothy - oz - scarecrow | 211 | 33_toto_kansas_dorothy_oz |
|
76 |
+
| 34 | lancelot - camelot - merlin - guinevere - arthur | 209 | 34_lancelot_camelot_merlin_guinevere |
|
77 |
+
| 35 | were - soldiers - seemed - soldier - th | 201 | 35_were_soldiers_seemed_soldier |
|
78 |
+
| 36 | were - was - fields - seemed - hills | 200 | 36_were_was_fields_seemed |
|
79 |
+
| 37 | reason - thyself - actions - thine - life | 179 | 37_reason_thyself_actions_thine |
|
80 |
+
| 38 | hetty - her - she - judith - were | 170 | 38_hetty_her_she_judith |
|
81 |
+
| 39 | othello - iago - desdemona - ll - roderigo | 170 | 39_othello_iago_desdemona_ll |
|
82 |
+
| 40 | wildeve - yes - were - vye - was | 165 | 40_wildeve_yes_were_vye |
|
83 |
+
| 41 | utilitarian - morality - morals - virtue - moral | 165 | 41_utilitarian_morality_morals_virtue |
|
84 |
+
| 42 | ransom - isaac - thine - thy - shekels | 163 | 42_ransom_isaac_thine_thy |
|
85 |
+
| 43 | weasels - rat - ratty - toad - badger | 157 | 43_weasels_rat_ratty_toad |
|
86 |
+
| 44 | philip - he - were - vicar - was | 155 | 44_philip_he_were_vicar |
|
87 |
+
| 45 | macbeth - banquo - macduff - fleance - murderer | 154 | 45_macbeth_banquo_macduff_fleance |
|
88 |
+
| 46 | lydgate - bulstrode - himself - he - had | 145 | 46_lydgate_bulstrode_himself_he |
|
89 |
+
| 47 | capulet - romeo - juliet - verona - mercutio | 142 | 47_capulet_romeo_juliet_verona |
|
90 |
+
| 48 | dying - her - were - helen - she | 141 | 48_dying_her_were_helen |
|
91 |
+
| 49 | anne - avonlea - diana - her - marilla | 141 | 49_anne_avonlea_diana_her |
|
92 |
+
| 50 | tartuffe - scene - dorine - pernelle - scoundrel | 140 | 50_tartuffe_scene_dorine_pernelle |
|
93 |
+
| 51 | were - yes - had - was - no | 139 | 51_were_yes_had_was |
|
94 |
+
| 52 | jekyll - hyde - were - myself - had | 135 | 52_jekyll_hyde_were_myself |
|
95 |
+
| 53 | loved - were - philip - was - could | 128 | 53_loved_were_philip_was |
|
96 |
+
| 54 | falstaff - mistress - ford - forsooth - windsor | 127 | 54_falstaff_mistress_ford_forsooth |
|
97 |
+
| 55 | hurstwood - were - barn - had - was | 127 | 55_hurstwood_were_barn_had |
|
98 |
+
| 56 | provost - capell - collier - conj - pope | 126 | 56_provost_capell_collier_conj |
|
99 |
+
| 57 | gretchen - highness - chancellor - hildegarde - yes | 125 | 57_gretchen_highness_chancellor_hildegarde |
|
100 |
+
| 58 | delamere - watson - dr - ll - no | 124 | 58_delamere_watson_dr_ll |
|
101 |
+
| 59 | jem - her - were - felt - margaret | 123 | 59_jem_her_were_felt |
|
102 |
+
| 60 | beowulf - grendel - hrothgar - wiglaf - hero | 111 | 60_beowulf_grendel_hrothgar_wiglaf |
|
103 |
+
| 61 | verloc - seemed - was - were - had | 102 | 61_verloc_seemed_was_were |
|
104 |
+
| 62 | hamlet - guildenstern - rosencrantz - fortinbras - polonius | 102 | 62_hamlet_guildenstern_rosencrantz_fortinbras |
|
105 |
+
| 63 | corey - mrs - yes - business - lapham | 101 | 63_corey_mrs_yes_business |
|
106 |
+
| 64 | projectiles - cannon - projectile - distance - satellite | 99 | 64_projectiles_cannon_projectile_distance |
|
107 |
+
| 65 | piano - musical - music - played - beethoven | 98 | 65_piano_musical_music_played |
|
108 |
+
| 66 | wedding - bridegroom - were - marriage - looked | 93 | 66_wedding_bridegroom_were_marriage |
|
109 |
+
| 67 | juan - her - fame - some - had | 92 | 67_juan_her_fame_some |
|
110 |
+
| 68 | were - looked - felt - her - had | 91 | 68_were_looked_felt_her |
|
111 |
+
| 69 | staked - gambling - wildeve - stakes - dice | 91 | 69_staked_gambling_wildeve_stakes |
|
112 |
+
| 70 | mistress - leonora - wanted - florence - was | 89 | 70_mistress_leonora_wanted_florence |
|
113 |
+
| 71 | delano - ship - sailor - captain - benito | 87 | 71_delano_ship_sailor_captain |
|
114 |
+
| 72 | yes - goring - no - robert - room | 85 | 72_yes_goring_no_robert |
|
115 |
+
| 73 | stockmann - yes - horster - mayor - dr | 81 | 73_stockmann_yes_horster_mayor |
|
116 |
+
| 74 | ll - were - looked - carl - was | 80 | 74_ll_were_looked_carl |
|
117 |
+
| 75 | barber - philosophy - no - some - man | 78 | 75_barber_philosophy_no_some |
|
118 |
+
| 76 | tom - maggie - came - had - tulliver | 78 | 76_tom_maggie_came_had |
|
119 |
+
| 77 | middlemarch - hustings - candidate - brooke - may | 75 | 77_middlemarch_hustings_candidate_brooke |
|
120 |
+
| 78 | inspector - verloc - yes - affair - police | 75 | 78_inspector_verloc_yes_affair |
|
121 |
+
| 79 | scrooge - merry - no - christmas - man | 73 | 79_scrooge_merry_no_christmas |
|
122 |
+
| 80 | coquenard - mutton - served - were - pudding | 70 | 80_coquenard_mutton_served_were |
|
123 |
+
| 81 | yes - no - jack - ll - tell | 69 | 81_yes_no_jack_ll |
|
124 |
+
| 82 | seth - lisbeth - th - ud - no | 67 | 82_seth_lisbeth_th_ud |
|
125 |
+
| 83 | higgins - eliza - her - she - liza | 66 | 83_higgins_eliza_her_she |
|
126 |
+
| 84 | yarmouth - were - went - had - was | 65 | 84_yarmouth_were_went_had |
|
127 |
+
| 85 | servian - sergius - yes - catherine - no | 64 | 85_servian_sergius_yes_catherine |
|
128 |
+
| 86 | service - army - salvation - institution - training | 61 | 86_service_army_salvation_institution |
|
129 |
+
| 87 | condemn - ff - pray - mercy - conj | 58 | 87_condemn_ff_pray_mercy |
|
130 |
+
| 88 | lucy - bartlett - were - could - she | 57 | 88_lucy_bartlett_were_could |
|
131 |
+
| 89 | wills - seemed - bequest - were - testator | 54 | 89_wills_seemed_bequest_were |
|
132 |
+
| 90 | scene - iii - malvolio - valentine - cesario | 54 | 90_scene_iii_malvolio_valentine |
|
133 |
+
| 91 | fuss - think - ll - thinks - oh | 53 | 91_fuss_think_ll_thinks |
|
134 |
+
| 92 | hermia - demetrius - helena - theseus - helen | 50 | 92_hermia_demetrius_helena_theseus |
|
135 |
+
| 93 | seemed - rochester - were - had - yes | 50 | 93_seemed_rochester_were_had |
|
136 |
+
| 94 | sorrow - mourned - myself - had - was | 48 | 94_sorrow_mourned_myself_had |
|
137 |
+
| 95 | gerty - sleepless - tea - weariness - tired | 48 | 95_gerty_sleepless_tea_weariness |
|
138 |
+
| 96 | rushworth - crawford - were - sotherton - was | 47 | 96_rushworth_crawford_were_sotherton |
|
139 |
+
| 97 | reasoning - syllogisme - names - signification - definitions | 46 | 97_reasoning_syllogisme_names_signification |
|
140 |
+
| 98 | could - caleb - sure - work - no | 46 | 98_could_caleb_sure_work |
|
141 |
+
| 99 | rose - tears - hope - tell - wish | 46 | 99_rose_tears_hope_tell |
|
142 |
+
| 100 | peggotty - em - gummidge - he - ll | 46 | 100_peggotty_em_gummidge_he |
|
143 |
+
| 101 | time - future - story - paradox - traveller | 46 | 101_time_future_story_paradox |
|
144 |
+
| 102 | cleopatra - antony - caesar - loved - slave | 45 | 102_cleopatra_antony_caesar_loved |
|
145 |
+
| 103 | appendicitis - doctors - doctor - dr - wanted | 45 | 103_appendicitis_doctors_doctor_dr |
|
146 |
+
| 104 | slept - awoke - waking - sleep - seemed | 44 | 104_slept_awoke_waking_sleep |
|
147 |
+
| 105 | parlour - room - seemed - sat - had | 43 | 105_parlour_room_seemed_sat |
|
148 |
+
| 106 | prophets - scripture - prophet - moses - prophecy | 43 | 106_prophets_scripture_prophet_moses |
|
149 |
+
| 107 | letter - honour - adieu - duval - evelina | 43 | 107_letter_honour_adieu_duval |
|
150 |
+
| 108 | complications - cranky - had - tanis - was | 43 | 108_complications_cranky_had_tanis |
|
151 |
+
| 109 | fled - armies - brussels - imperial - napoleon | 42 | 109_fled_armies_brussels_imperial |
|
152 |
+
| 110 | philip - easel - greco - impressionists - manet | 42 | 110_philip_easel_greco_impressionists |
|
153 |
+
| 111 | harlings - harling - frances - were - shimerdas | 40 | 111_harlings_harling_frances_were |
|
154 |
+
| 112 | jane - mrs - janet - eyre - her | 40 | 112_jane_mrs_janet_eyre |
|
155 |
+
| 113 | prisoner - confinement - prisoners - prison - gaoler | 40 | 113_prisoner_confinement_prisoners_prison |
|
156 |
+
| 114 | hardcastle - marlow - impudence - constance - modesty | 40 | 114_hardcastle_marlow_impudence_constance |
|
157 |
+
| 115 | horatio - murder - revenge - sorrow - hieronimo | 40 | 115_horatio_murder_revenge_sorrow |
|
158 |
+
| 116 | traddles - had - married - room - horace | 39 | 116_traddles_had_married_room |
|
159 |
+
| 117 | philip - tell - feelings - was - remember | 38 | 117_philip_tell_feelings_was |
|
160 |
+
| 118 | nervous - countenance - seemed - he - huxtable | 38 | 118_nervous_countenance_seemed_he |
|
161 |
+
| 119 | rogers - wanted - lapham - could - silas | 38 | 119_rogers_wanted_lapham_could |
|
162 |
+
| 120 | titus - timon - varro - servilius - alcibiades | 37 | 120_titus_timon_varro_servilius |
|
163 |
+
| 121 | morality - justice - moral - impartiality - unjust | 37 | 121_morality_justice_moral_impartiality |
|
164 |
+
| 122 | willard - elmer - were - was - henderson | 37 | 122_willard_elmer_were_was |
|
165 |
+
| 123 | had - was - could - circumstances - possession | 37 | 123_had_was_could_circumstances |
|
166 |
+
| 124 | monkey - he - sahib - rat - sara | 36 | 124_monkey_he_sahib_rat |
|
167 |
+
| 125 | mcmurdo - mcginty - cormac - police - scanlan | 36 | 125_mcmurdo_mcginty_cormac_police |
|
168 |
+
| 126 | hetty - herself - she - her - had | 36 | 126_hetty_herself_she_her |
|
169 |
+
| 127 | dimmesdale - reverend - chillingworth - clergyman - deacon | 35 | 127_dimmesdale_reverend_chillingworth_clergyman |
|
170 |
+
| 128 | formerly - eliza - was - friend - friends | 34 | 128_formerly_eliza_was_friend |
|
171 |
+
| 129 | were - seemed - had - was - felt | 34 | 129_were_seemed_had_was |
|
172 |
+
| 130 | prisoner - jerry - lorry - tellson - court | 33 | 130_prisoner_jerry_lorry_tellson |
|
173 |
+
| 131 | macmurdo - wenham - captain - steyne - crawley | 33 | 131_macmurdo_wenham_captain_steyne |
|
174 |
+
| 132 | ducal - duchy - xv - fetes - theatre | 32 | 132_ducal_duchy_xv_fetes |
|
175 |
+
| 133 | chapter - book - dows - unt - windowpane | 32 | 133_chapter_book_dows_unt |
|
176 |
+
| 134 | money - riches - things - risk - thoughts | 31 | 134_money_riches_things_risk |
|
177 |
+
| 135 | bethy - beth - seemed - sister - her | 31 | 135_bethy_beth_seemed_sister |
|
178 |
+
| 136 | oliver - pickwick - were - was - inn | 30 | 136_oliver_pickwick_were_was |
|
179 |
+
|
180 |
+
</details>
|
181 |
+
|
182 |
+
## Training hyperparameters
|
183 |
+
|
184 |
+
* calculate_probabilities: True
|
185 |
+
* language: None
|
186 |
+
* low_memory: False
|
187 |
+
* min_topic_size: 30
|
188 |
+
* n_gram_range: (1, 1)
|
189 |
+
* nr_topics: auto
|
190 |
+
* seed_topic_list: None
|
191 |
+
* top_n_words: 10
|
192 |
+
* verbose: True
|
193 |
+
|
194 |
+
## Framework versions
|
195 |
+
|
196 |
+
* Numpy: 1.24.3
|
197 |
+
* HDBSCAN: 0.8.29
|
198 |
+
* UMAP: 0.5.3
|
199 |
+
* Pandas: 2.0.2
|
200 |
+
* Scikit-Learn: 1.2.2
|
201 |
+
* Sentence-transformers: 2.2.2
|
202 |
+
* Transformers: 4.30.2
|
203 |
+
* Numba: 0.57.1
|
204 |
+
* Plotly: 5.15.0
|
205 |
+
* Python: 3.10.11
|
config.json
ADDED
@@ -0,0 +1,15 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"calculate_probabilities": true,
|
3 |
+
"language": null,
|
4 |
+
"low_memory": false,
|
5 |
+
"min_topic_size": 30,
|
6 |
+
"n_gram_range": [
|
7 |
+
1,
|
8 |
+
1
|
9 |
+
],
|
10 |
+
"nr_topics": "auto",
|
11 |
+
"seed_topic_list": null,
|
12 |
+
"top_n_words": 10,
|
13 |
+
"verbose": true,
|
14 |
+
"embedding_model": "sentence-transformers/sentence-t5-xl"
|
15 |
+
}
|
ctfidf.safetensors
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:fe48aeaf967ef9cb31d88fbeb6f679c7e871ff629eabd7d2fddaec3ad14e0c51
|
3 |
+
size 9194548
|
ctfidf_config.json
ADDED
The diff for this file is too large to render.
See raw diff
|
|
topic_embeddings.safetensors
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:ce312318dd491badda2233c28b27ec1f4df61b5b8e579df6a310e7513bca7c01
|
3 |
+
size 424024
|
topics.json
ADDED
The diff for this file is too large to render.
See raw diff
|
|