--- pipeline_tag: sentence-similarity tags: - sentence-transformers - feature-extraction - sentence-similarity - transformers - semantic-search - chinese - mteb model-index: - name: sbert-chinese-general-v1 results: - task: type: STS dataset: type: C-MTEB/AFQMC name: MTEB AFQMC config: default split: validation revision: None metrics: - type: cos_sim_pearson value: 22.293919432958074 - type: cos_sim_spearman value: 22.56718923553609 - type: euclidean_pearson value: 22.525656322797026 - type: euclidean_spearman value: 22.56718923553609 - type: manhattan_pearson value: 22.501773028824065 - type: manhattan_spearman value: 22.536992587828397 - task: type: STS dataset: type: C-MTEB/ATEC name: MTEB ATEC config: default split: test revision: None metrics: - type: cos_sim_pearson value: 30.33575274463879 - type: cos_sim_spearman value: 30.298708742167772 - type: euclidean_pearson value: 32.33094743729218 - type: euclidean_spearman value: 30.298710993858734 - type: manhattan_pearson value: 32.31155376195945 - type: manhattan_spearman value: 30.267669681690744 - task: type: Classification dataset: type: mteb/amazon_reviews_multi name: MTEB AmazonReviewsClassification (zh) config: zh split: test revision: 1399c76144fd37290681b995c656ef9b2e06e26d metrics: - type: accuracy value: 37.507999999999996 - type: f1 value: 36.436808400753286 - task: type: STS dataset: type: C-MTEB/BQ name: MTEB BQ config: default split: test revision: None metrics: - type: cos_sim_pearson value: 41.493256724214255 - type: cos_sim_spearman value: 40.98395961967895 - type: euclidean_pearson value: 41.12345737966565 - type: euclidean_spearman value: 40.983959619555996 - type: manhattan_pearson value: 41.02584539471014 - type: manhattan_spearman value: 40.87549513383032 - task: type: BitextMining dataset: type: mteb/bucc-bitext-mining name: MTEB BUCC (zh-en) config: zh-en split: test revision: d51519689f32196a32af33b075a01d0e7c51e252 metrics: - type: accuracy value: 9.794628751974724 - type: f1 value: 9.350535369492716 - type: precision value: 9.179392662804986 - type: recall value: 9.794628751974724 - task: type: Clustering dataset: type: C-MTEB/CLSClusteringP2P name: MTEB CLSClusteringP2P config: default split: test revision: None metrics: - type: v_measure value: 34.984726547788284 - task: type: Clustering dataset: type: C-MTEB/CLSClusteringS2S name: MTEB CLSClusteringS2S config: default split: test revision: None metrics: - type: v_measure value: 27.81945732281589 - task: type: Reranking dataset: type: C-MTEB/CMedQAv1-reranking name: MTEB CMedQAv1 config: default split: test revision: None metrics: - type: map value: 53.06586280826805 - type: mrr value: 59.58781746031746 - task: type: Reranking dataset: type: C-MTEB/CMedQAv2-reranking name: MTEB CMedQAv2 config: default split: test revision: None metrics: - type: map value: 52.83635946154306 - type: mrr value: 59.315079365079356 - task: type: Retrieval dataset: type: C-MTEB/CmedqaRetrieval name: MTEB CmedqaRetrieval config: default split: dev revision: None metrics: - type: map_at_1 value: 5.721 - type: map_at_10 value: 8.645 - type: map_at_100 value: 9.434 - type: map_at_1000 value: 9.586 - type: map_at_3 value: 7.413 - type: map_at_5 value: 8.05 - type: mrr_at_1 value: 9.626999999999999 - type: mrr_at_10 value: 13.094 - type: mrr_at_100 value: 13.854 - type: mrr_at_1000 value: 13.958 - type: mrr_at_3 value: 11.724 - type: mrr_at_5 value: 12.409 - type: ndcg_at_1 value: 9.626999999999999 - type: ndcg_at_10 value: 11.35 - type: ndcg_at_100 value: 15.593000000000002 - type: ndcg_at_1000 value: 19.619 - type: ndcg_at_3 value: 9.317 - type: ndcg_at_5 value: 10.049 - type: precision_at_1 value: 9.626999999999999 - type: precision_at_10 value: 2.796 - type: precision_at_100 value: 0.629 - type: precision_at_1000 value: 0.11800000000000001 - type: precision_at_3 value: 5.476 - type: precision_at_5 value: 4.1209999999999996 - type: recall_at_1 value: 5.721 - type: recall_at_10 value: 15.190000000000001 - type: recall_at_100 value: 33.633 - type: recall_at_1000 value: 62.019999999999996 - type: recall_at_3 value: 9.099 - type: recall_at_5 value: 11.423 - task: type: PairClassification dataset: type: C-MTEB/CMNLI name: MTEB Cmnli config: default split: validation revision: None metrics: - type: cos_sim_accuracy value: 77.36620565243535 - type: cos_sim_ap value: 85.92291866877001 - type: cos_sim_f1 value: 78.19390231037029 - type: cos_sim_precision value: 71.24183006535948 - type: cos_sim_recall value: 86.64952069207388 - type: dot_accuracy value: 77.36620565243535 - type: dot_ap value: 85.94113738490068 - type: dot_f1 value: 78.19390231037029 - type: dot_precision value: 71.24183006535948 - type: dot_recall value: 86.64952069207388 - type: euclidean_accuracy value: 77.36620565243535 - type: euclidean_ap value: 85.92291893444687 - type: euclidean_f1 value: 78.19390231037029 - type: euclidean_precision value: 71.24183006535948 - type: euclidean_recall value: 86.64952069207388 - type: manhattan_accuracy value: 77.29404690318701 - type: manhattan_ap value: 85.88284362100919 - type: manhattan_f1 value: 78.17836812144213 - type: manhattan_precision value: 71.18448838548666 - type: manhattan_recall value: 86.69628244096329 - type: max_accuracy value: 77.36620565243535 - type: max_ap value: 85.94113738490068 - type: max_f1 value: 78.19390231037029 - task: type: Retrieval dataset: type: C-MTEB/CovidRetrieval name: MTEB CovidRetrieval config: default split: dev revision: None metrics: - type: map_at_1 value: 26.976 - type: map_at_10 value: 35.18 - type: map_at_100 value: 35.921 - type: map_at_1000 value: 35.998999999999995 - type: map_at_3 value: 32.763 - type: map_at_5 value: 34.165 - type: mrr_at_1 value: 26.976 - type: mrr_at_10 value: 35.234 - type: mrr_at_100 value: 35.939 - type: mrr_at_1000 value: 36.016 - type: mrr_at_3 value: 32.771 - type: mrr_at_5 value: 34.172999999999995 - type: ndcg_at_1 value: 26.976 - type: ndcg_at_10 value: 39.635 - type: ndcg_at_100 value: 43.54 - type: ndcg_at_1000 value: 45.723 - type: ndcg_at_3 value: 34.652 - type: ndcg_at_5 value: 37.186 - type: precision_at_1 value: 26.976 - type: precision_at_10 value: 5.406 - type: precision_at_100 value: 0.736 - type: precision_at_1000 value: 0.091 - type: precision_at_3 value: 13.418 - type: precision_at_5 value: 9.293999999999999 - type: recall_at_1 value: 26.976 - type: recall_at_10 value: 53.766999999999996 - type: recall_at_100 value: 72.761 - type: recall_at_1000 value: 90.148 - type: recall_at_3 value: 40.095 - type: recall_at_5 value: 46.233000000000004 - task: type: Retrieval dataset: type: C-MTEB/DuRetrieval name: MTEB DuRetrieval config: default split: dev revision: None metrics: - type: map_at_1 value: 11.285 - type: map_at_10 value: 30.259000000000004 - type: map_at_100 value: 33.772000000000006 - type: map_at_1000 value: 34.037 - type: map_at_3 value: 21.038999999999998 - type: map_at_5 value: 25.939 - type: mrr_at_1 value: 45.1 - type: mrr_at_10 value: 55.803999999999995 - type: mrr_at_100 value: 56.301 - type: mrr_at_1000 value: 56.330999999999996 - type: mrr_at_3 value: 53.333 - type: mrr_at_5 value: 54.798 - type: ndcg_at_1 value: 45.1 - type: ndcg_at_10 value: 41.156 - type: ndcg_at_100 value: 49.518 - type: ndcg_at_1000 value: 52.947 - type: ndcg_at_3 value: 39.708 - type: ndcg_at_5 value: 38.704 - type: precision_at_1 value: 45.1 - type: precision_at_10 value: 20.75 - type: precision_at_100 value: 3.424 - type: precision_at_1000 value: 0.42700000000000005 - type: precision_at_3 value: 35.632999999999996 - type: precision_at_5 value: 30.080000000000002 - type: recall_at_1 value: 11.285 - type: recall_at_10 value: 43.242000000000004 - type: recall_at_100 value: 68.604 - type: recall_at_1000 value: 85.904 - type: recall_at_3 value: 24.404 - type: recall_at_5 value: 32.757 - task: type: Retrieval dataset: type: C-MTEB/EcomRetrieval name: MTEB EcomRetrieval config: default split: dev revision: None metrics: - type: map_at_1 value: 21 - type: map_at_10 value: 28.364 - type: map_at_100 value: 29.199 - type: map_at_1000 value: 29.265 - type: map_at_3 value: 25.717000000000002 - type: map_at_5 value: 27.311999999999998 - type: mrr_at_1 value: 21 - type: mrr_at_10 value: 28.364 - type: mrr_at_100 value: 29.199 - type: mrr_at_1000 value: 29.265 - type: mrr_at_3 value: 25.717000000000002 - type: mrr_at_5 value: 27.311999999999998 - type: ndcg_at_1 value: 21 - type: ndcg_at_10 value: 32.708 - type: ndcg_at_100 value: 37.184 - type: ndcg_at_1000 value: 39.273 - type: ndcg_at_3 value: 27.372000000000003 - type: ndcg_at_5 value: 30.23 - type: precision_at_1 value: 21 - type: precision_at_10 value: 4.66 - type: precision_at_100 value: 0.685 - type: precision_at_1000 value: 0.086 - type: precision_at_3 value: 10.732999999999999 - type: precision_at_5 value: 7.82 - type: recall_at_1 value: 21 - type: recall_at_10 value: 46.6 - type: recall_at_100 value: 68.5 - type: recall_at_1000 value: 85.6 - type: recall_at_3 value: 32.2 - type: recall_at_5 value: 39.1 - task: type: Classification dataset: type: C-MTEB/IFlyTek-classification name: MTEB IFlyTek config: default split: validation revision: None metrics: - type: accuracy value: 44.878799538283964 - type: f1 value: 33.84678310261366 - task: type: Classification dataset: type: C-MTEB/JDReview-classification name: MTEB JDReview config: default split: test revision: None metrics: - type: accuracy value: 82.1951219512195 - type: ap value: 46.78292030042397 - type: f1 value: 76.20482468514128 - task: type: STS dataset: type: C-MTEB/LCQMC name: MTEB LCQMC config: default split: test revision: None metrics: - type: cos_sim_pearson value: 62.84331627244547 - type: cos_sim_spearman value: 68.39990265073726 - type: euclidean_pearson value: 66.87431827169324 - type: euclidean_spearman value: 68.39990264979167 - type: manhattan_pearson value: 66.89702078900328 - type: manhattan_spearman value: 68.42107302159141 - task: type: Reranking dataset: type: C-MTEB/Mmarco-reranking name: MTEB MMarcoReranking config: default split: dev revision: None metrics: - type: map value: 9.28600891904827 - type: mrr value: 8.057936507936509 - task: type: Retrieval dataset: type: C-MTEB/MMarcoRetrieval name: MTEB MMarcoRetrieval config: default split: dev revision: None metrics: - type: map_at_1 value: 22.820999999999998 - type: map_at_10 value: 30.44 - type: map_at_100 value: 31.35 - type: map_at_1000 value: 31.419000000000004 - type: map_at_3 value: 28.134999999999998 - type: map_at_5 value: 29.482000000000003 - type: mrr_at_1 value: 23.782 - type: mrr_at_10 value: 31.141999999999996 - type: mrr_at_100 value: 32.004 - type: mrr_at_1000 value: 32.068000000000005 - type: mrr_at_3 value: 28.904000000000003 - type: mrr_at_5 value: 30.214999999999996 - type: ndcg_at_1 value: 23.782 - type: ndcg_at_10 value: 34.625 - type: ndcg_at_100 value: 39.226 - type: ndcg_at_1000 value: 41.128 - type: ndcg_at_3 value: 29.968 - type: ndcg_at_5 value: 32.35 - type: precision_at_1 value: 23.782 - type: precision_at_10 value: 4.994 - type: precision_at_100 value: 0.736 - type: precision_at_1000 value: 0.09 - type: precision_at_3 value: 12.13 - type: precision_at_5 value: 8.495999999999999 - type: recall_at_1 value: 22.820999999999998 - type: recall_at_10 value: 47.141 - type: recall_at_100 value: 68.952 - type: recall_at_1000 value: 83.985 - type: recall_at_3 value: 34.508 - type: recall_at_5 value: 40.232 - task: type: Classification dataset: type: mteb/amazon_massive_intent name: MTEB MassiveIntentClassification (zh-CN) config: zh-CN split: test revision: 31efe3c427b0bae9c22cbb560b8f15491cc6bed7 metrics: - type: accuracy value: 57.343644922663074 - type: f1 value: 56.744802953803486 - task: type: Classification dataset: type: mteb/amazon_massive_scenario name: MTEB MassiveScenarioClassification (zh-CN) config: zh-CN split: test revision: 7d571f92784cd94a019292a1f45445077d0ef634 metrics: - type: accuracy value: 62.363819771351714 - type: f1 value: 62.15920863434656 - task: type: Retrieval dataset: type: C-MTEB/MedicalRetrieval name: MTEB MedicalRetrieval config: default split: dev revision: None metrics: - type: map_at_1 value: 14.6 - type: map_at_10 value: 18.231 - type: map_at_100 value: 18.744 - type: map_at_1000 value: 18.811 - type: map_at_3 value: 17.133000000000003 - type: map_at_5 value: 17.663 - type: mrr_at_1 value: 14.6 - type: mrr_at_10 value: 18.231 - type: mrr_at_100 value: 18.744 - type: mrr_at_1000 value: 18.811 - type: mrr_at_3 value: 17.133000000000003 - type: mrr_at_5 value: 17.663 - type: ndcg_at_1 value: 14.6 - type: ndcg_at_10 value: 20.349 - type: ndcg_at_100 value: 23.204 - type: ndcg_at_1000 value: 25.44 - type: ndcg_at_3 value: 17.995 - type: ndcg_at_5 value: 18.945999999999998 - type: precision_at_1 value: 14.6 - type: precision_at_10 value: 2.7199999999999998 - type: precision_at_100 value: 0.414 - type: precision_at_1000 value: 0.06 - type: precision_at_3 value: 6.833 - type: precision_at_5 value: 4.5600000000000005 - type: recall_at_1 value: 14.6 - type: recall_at_10 value: 27.200000000000003 - type: recall_at_100 value: 41.4 - type: recall_at_1000 value: 60 - type: recall_at_3 value: 20.5 - type: recall_at_5 value: 22.8 - task: type: Classification dataset: type: C-MTEB/MultilingualSentiment-classification name: MTEB MultilingualSentiment config: default split: validation revision: None metrics: - type: accuracy value: 66.58333333333333 - type: f1 value: 66.26700927460007 - task: type: PairClassification dataset: type: C-MTEB/OCNLI name: MTEB Ocnli config: default split: validation revision: None metrics: - type: cos_sim_accuracy value: 72.00866269626421 - type: cos_sim_ap value: 77.00520104243304 - type: cos_sim_f1 value: 74.39303710490151 - type: cos_sim_precision value: 65.69579288025889 - type: cos_sim_recall value: 85.74445617740233 - type: dot_accuracy value: 72.00866269626421 - type: dot_ap value: 77.00520104243304 - type: dot_f1 value: 74.39303710490151 - type: dot_precision value: 65.69579288025889 - type: dot_recall value: 85.74445617740233 - type: euclidean_accuracy value: 72.00866269626421 - type: euclidean_ap value: 77.00520104243304 - type: euclidean_f1 value: 74.39303710490151 - type: euclidean_precision value: 65.69579288025889 - type: euclidean_recall value: 85.74445617740233 - type: manhattan_accuracy value: 72.1710882512182 - type: manhattan_ap value: 77.00551017913976 - type: manhattan_f1 value: 74.23423423423424 - type: manhattan_precision value: 64.72898664571878 - type: manhattan_recall value: 87.0116156282999 - type: max_accuracy value: 72.1710882512182 - type: max_ap value: 77.00551017913976 - type: max_f1 value: 74.39303710490151 - task: type: Classification dataset: type: C-MTEB/OnlineShopping-classification name: MTEB OnlineShopping config: default split: test revision: None metrics: - type: accuracy value: 88.19000000000001 - type: ap value: 85.13415594781077 - type: f1 value: 88.17344156114062 - task: type: STS dataset: type: C-MTEB/PAWSX name: MTEB PAWSX config: default split: test revision: None metrics: - type: cos_sim_pearson value: 13.70522140998517 - type: cos_sim_spearman value: 15.07546667334743 - type: euclidean_pearson value: 17.49511420225285 - type: euclidean_spearman value: 15.093970931789618 - type: manhattan_pearson value: 17.44069961390521 - type: manhattan_spearman value: 15.076029291596962 - task: type: STS dataset: type: C-MTEB/QBQTC name: MTEB QBQTC config: default split: test revision: None metrics: - type: cos_sim_pearson value: 26.835294224547155 - type: cos_sim_spearman value: 27.920204597498856 - type: euclidean_pearson value: 26.153796707702803 - type: euclidean_spearman value: 27.920971379720548 - type: manhattan_pearson value: 26.21954147857523 - type: manhattan_spearman value: 27.996860049937478 - task: type: STS dataset: type: mteb/sts22-crosslingual-sts name: MTEB STS22 (zh) config: zh split: test revision: 6d1ba47164174a496b7fa5d3569dae26a6813b80 metrics: - type: cos_sim_pearson value: 55.15901259718581 - type: cos_sim_spearman value: 61.57967880874167 - type: euclidean_pearson value: 53.83523291596683 - type: euclidean_spearman value: 61.57967880874167 - type: manhattan_pearson value: 54.99971428907956 - type: manhattan_spearman value: 61.61229543613867 - task: type: STS dataset: type: mteb/sts22-crosslingual-sts name: MTEB STS22 (zh-en) config: zh-en split: test revision: 6d1ba47164174a496b7fa5d3569dae26a6813b80 metrics: - type: cos_sim_pearson value: 34.20930208460845 - type: cos_sim_spearman value: 33.879011104224524 - type: euclidean_pearson value: 35.08526425284862 - type: euclidean_spearman value: 33.879011104224524 - type: manhattan_pearson value: 35.509419089701275 - type: manhattan_spearman value: 33.30035487147621 - task: type: STS dataset: type: C-MTEB/STSB name: MTEB STSB config: default split: test revision: None metrics: - type: cos_sim_pearson value: 82.30068282185835 - type: cos_sim_spearman value: 82.16763221361724 - type: euclidean_pearson value: 80.52772752433374 - type: euclidean_spearman value: 82.16797037220333 - type: manhattan_pearson value: 80.51093859500105 - type: manhattan_spearman value: 82.17643310049654 - task: type: Reranking dataset: type: C-MTEB/T2Reranking name: MTEB T2Reranking config: default split: dev revision: None metrics: - type: map value: 65.14113035189213 - type: mrr value: 74.9589270937443 - task: type: Retrieval dataset: type: C-MTEB/T2Retrieval name: MTEB T2Retrieval config: default split: dev revision: None metrics: - type: map_at_1 value: 12.013 - type: map_at_10 value: 30.885 - type: map_at_100 value: 34.643 - type: map_at_1000 value: 34.927 - type: map_at_3 value: 21.901 - type: map_at_5 value: 26.467000000000002 - type: mrr_at_1 value: 49.623 - type: mrr_at_10 value: 58.05200000000001 - type: mrr_at_100 value: 58.61300000000001 - type: mrr_at_1000 value: 58.643 - type: mrr_at_3 value: 55.947 - type: mrr_at_5 value: 57.229 - type: ndcg_at_1 value: 49.623 - type: ndcg_at_10 value: 41.802 - type: ndcg_at_100 value: 49.975 - type: ndcg_at_1000 value: 53.504 - type: ndcg_at_3 value: 43.515 - type: ndcg_at_5 value: 41.576 - type: precision_at_1 value: 49.623 - type: precision_at_10 value: 22.052 - type: precision_at_100 value: 3.6450000000000005 - type: precision_at_1000 value: 0.45399999999999996 - type: precision_at_3 value: 38.616 - type: precision_at_5 value: 31.966 - type: recall_at_1 value: 12.013 - type: recall_at_10 value: 41.891 - type: recall_at_100 value: 67.096 - type: recall_at_1000 value: 84.756 - type: recall_at_3 value: 24.695 - type: recall_at_5 value: 32.09 - task: type: Classification dataset: type: C-MTEB/TNews-classification name: MTEB TNews config: default split: validation revision: None metrics: - type: accuracy value: 39.800999999999995 - type: f1 value: 38.5345899934575 - task: type: Clustering dataset: type: C-MTEB/ThuNewsClusteringP2P name: MTEB ThuNewsClusteringP2P config: default split: test revision: None metrics: - type: v_measure value: 40.16574242797479 - task: type: Clustering dataset: type: C-MTEB/ThuNewsClusteringS2S name: MTEB ThuNewsClusteringS2S config: default split: test revision: None metrics: - type: v_measure value: 24.232617974671754 - task: type: Retrieval dataset: type: C-MTEB/VideoRetrieval name: MTEB VideoRetrieval config: default split: dev revision: None metrics: - type: map_at_1 value: 24.6 - type: map_at_10 value: 31.328 - type: map_at_100 value: 32.088 - type: map_at_1000 value: 32.164 - type: map_at_3 value: 29.133 - type: map_at_5 value: 30.358 - type: mrr_at_1 value: 24.6 - type: mrr_at_10 value: 31.328 - type: mrr_at_100 value: 32.088 - type: mrr_at_1000 value: 32.164 - type: mrr_at_3 value: 29.133 - type: mrr_at_5 value: 30.358 - type: ndcg_at_1 value: 24.6 - type: ndcg_at_10 value: 35.150999999999996 - type: ndcg_at_100 value: 39.024 - type: ndcg_at_1000 value: 41.157 - type: ndcg_at_3 value: 30.637999999999998 - type: ndcg_at_5 value: 32.833 - type: precision_at_1 value: 24.6 - type: precision_at_10 value: 4.74 - type: precision_at_100 value: 0.66 - type: precision_at_1000 value: 0.083 - type: precision_at_3 value: 11.667 - type: precision_at_5 value: 8.06 - type: recall_at_1 value: 24.6 - type: recall_at_10 value: 47.4 - type: recall_at_100 value: 66 - type: recall_at_1000 value: 83 - type: recall_at_3 value: 35 - type: recall_at_5 value: 40.300000000000004 - task: type: Classification dataset: type: C-MTEB/waimai-classification name: MTEB Waimai config: default split: test revision: None metrics: - type: accuracy value: 83.96000000000001 - type: ap value: 65.11027167433211 - type: f1 value: 82.03549710974653 license: apache-2.0 language: - zh --- # DMetaSoul/sbert-chinese-general-v1 此模型基于 [bert-base-chinese](https://huggingface.co/bert-base-chinese) 版本 BERT 模型,在 NLI、PAWS-X、PKU-Paraphrase-Bank、STS 等语义相似数据集上进行训练,适用于**通用语义匹配**场景(此模型在 Chinese-STS 任务上效果较好,但在其它任务上效果并非最优,存在一定过拟合风险),比如文本特征抽取、文本向量聚类、文本语义搜索等业务场景。 注:此模型的[轻量化版本](https://huggingface.co/DMetaSoul/sbert-chinese-general-v1-distill),也已经开源啦! # Usage ## 1. Sentence-Transformers 通过 [sentence-transformers](https://www.SBERT.net) 框架来使用该模型,首先进行安装: ``` pip install -U sentence-transformers ``` 然后使用下面的代码来载入该模型并进行文本表征向量的提取: ```python from sentence_transformers import SentenceTransformer sentences = ["我的儿子!他猛然间喊道,我的儿子在哪儿?", "我的儿子呢!他突然喊道,我的儿子在哪里?"] model = SentenceTransformer('DMetaSoul/sbert-chinese-general-v1') embeddings = model.encode(sentences) print(embeddings) ``` ## 2. HuggingFace Transformers 如果不想使用 [sentence-transformers](https://www.SBERT.net) 的话,也可以通过 HuggingFace Transformers 来载入该模型并进行文本向量抽取: ```python from transformers import AutoTokenizer, AutoModel import torch #Mean Pooling - Take attention mask into account for correct averaging def mean_pooling(model_output, attention_mask): token_embeddings = model_output[0] #First element of model_output contains all token embeddings input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float() return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9) # Sentences we want sentence embeddings for sentences = ["我的儿子!他猛然间喊道,我的儿子在哪儿?", "我的儿子呢!他突然喊道,我的儿子在哪里?"] # Load model from HuggingFace Hub tokenizer = AutoTokenizer.from_pretrained('DMetaSoul/sbert-chinese-general-v1') model = AutoModel.from_pretrained('DMetaSoul/sbert-chinese-general-v1') # Tokenize sentences encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt') # Compute token embeddings with torch.no_grad(): model_output = model(**encoded_input) # Perform pooling. In this case, mean pooling. sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask']) print("Sentence embeddings:") print(sentence_embeddings) ``` ## Evaluation 该模型在公开的几个语义匹配数据集上进行了评测,计算了向量相似度跟真实标签之间的相关性系数: | | **csts_dev** | **csts_test** | **afqmc** | **lcqmc** | **bqcorpus** | **pawsx** | **xiaobu** | | ------------ | ------------ | ------------- | --------- | --------- | ------------ | --------- | ---------- | | **spearman** | 84.54% | 82.17% | 23.80% | 65.94% | 45.52% | 11.52% | 48.51% | ## Citing & Authors E-mail: xiaowenbin@dmetasoul.com