1-800-BAD-CODE
/

xlm-roberta_punctuation_fullstop_truecase

@@ -69,6 +69,8 @@ and detect sentence boundaries (full stops) in 47 languages.
 # Usage
 The easiest way to use this model is to install [punctuators](https://github.com/1-800-BAD-CODE/punctuators):
 ```bash
@@ -178,6 +180,130 @@ Outputs:
 </details>
 # Model Architecture
 This model implements the following graph, which allows punctuation, true-casing, and fullstop prediction
 in every language without language-specific behavior:
@@ -735,6 +861,7 @@ seg test report:
 </details>
 # Extra Stuff

 # Usage
+## Usage via `punctuators` package
 The easiest way to use this model is to install [punctuators](https://github.com/1-800-BAD-CODE/punctuators):
 ```bash
 </details>
+## Manual Usage
+If you want to use the ONNX and SP models without wrappers, see the following example.
+<details>
+  <summary>Click to see manual usage</summary>
+```python
+from typing import List
+import numpy as np
+import onnxruntime as ort
+from huggingface_hub import hf_hub_download
+from omegaconf import OmegaConf
+from sentencepiece import SentencePieceProcessor
+# Download the models from HF hub. Note: to clean up, you can find these files in your HF cache directory
+spe_path = hf_hub_download(repo_id="1-800-BAD-CODE/xlm-roberta_punctuation_fullstop_truecase", filename="sp.model")
+onnx_path = hf_hub_download(repo_id="1-800-BAD-CODE/xlm-roberta_punctuation_fullstop_truecase", filename="model.onnx")
+config_path = hf_hub_download(
+    repo_id="1-800-BAD-CODE/xlm-roberta_punctuation_fullstop_truecase", filename="config.yaml"
+)
+# Load the SP model
+tokenizer: SentencePieceProcessor = SentencePieceProcessor(spe_path)  # noqa
+# Load the ONNX graph
+ort_session: ort.InferenceSession = ort.InferenceSession(onnx_path)
+# Load the model config with labels, etc.
+config = OmegaConf.load(config_path)
+# Potential classification labels before each subtoken
+pre_labels: List[str] = config.pre_labels
+# Potential classification labels after each subtoken
+post_labels: List[str] = config.post_labels
+# Special class that means "predict nothing"
+null_token = config.get("null_token", "<NULL>")
+# Special class that means "all chars in this subtoken end with a period", e.g., "am" -> "a.m."
+acronym_token = config.get("acronym_token", "<ACRONYM>")
+# Not used in this example, but if your sequence exceed this value, you need to fold it over multiple inputs
+max_len = config.max_length
+# For reference only, graph has no language-specific behavior
+languages: List[str] = config.languages
+# Encode some input text, adding BOS + EOS
+input_text = "hola mundo cómo estás estamos bajo el sol y hace mucho calor santa coloma abre los huertos urbanos a las escuelas de la ciudad"
+input_ids = [tokenizer.bos_id()] + tokenizer.EncodeAsIds(input_text) + [tokenizer.eos_id()]
+# Create a numpy array with shape [B, T], as the graph expects as input.
+# Note that we do not pass lengths to the graph; if you are using a batch, padding should be tokenizer.pad_id() and the
+# graph's attention mechanisms will ignore pad_id() without requiring explicit sequence lengths.
+input_ids_arr: np.array = np.array([input_ids])
+# Run the graph, get outputs for all analytics
+pre_preds, post_preds, cap_preds, sbd_preds = ort_session.run(None, {"input_ids": input_ids_arr})
+# Squeeze off the batch dimensions and convert to lists
+pre_preds = pre_preds[0].tolist()
+post_preds = post_preds[0].tolist()
+cap_preds = cap_preds[0].tolist()
+sbd_preds = sbd_preds[0].tolist()
+# Segmented sentences
+output_texts: List[str] = []
+# Current sentence, which is built until we hit a sentence boundary prediction
+current_chars: List[str] = []
+# Iterate over the outputs, ignoring the first (BOS) and final (EOS) predictions and tokens
+for token_idx in range(1, len(input_ids) - 1):
+    token = tokenizer.IdToPiece(input_ids[token_idx])
+    # Simple SP decoding
+    if token.startswith("▁") and current_chars:
+        current_chars.append(" ")
+    # Token-level predictions
+    pre_label = pre_labels[pre_preds[token_idx]]
+    post_label = post_labels[post_preds[token_idx]]
+    # If we predict "pre-punct", insert it before this token
+    if pre_label != null_token:
+        current_chars.append(pre_label)
+    # Iterate over each char. Skip SP's space token,
+    char_start = 1 if token.startswith("▁") else 0
+    for token_char_idx, char in enumerate(token[char_start:], start=char_start):
+        # If this char should be capitalized, apply upper case
+        if cap_preds[token_idx][token_char_idx]:
+            char = char.upper()
+        # Append char
+        current_chars.append(char)
+        # if this is an acronym, add a period after every char (p.m., a.m., etc.)
+        if post_label == acronym_token:
+            current_chars.append(".")
+    # Maybe this subtoken ends with punctuation
+    if post_label != null_token and post_label != acronym_token:
+        current_chars.append(post_label)
+    # If this token is a sentence boundary, finalize the current sentence and reset
+    if sbd_preds[token_idx]:
+        output_texts.append("".join(current_chars))
+        current_chars.clear()
+# Maybe push final sentence, if the final token was not classified as a sentence boundary
+if current_chars:
+    output_texts.append("".join(current_chars))
+# Pretty print
+print(f"Input: {input_text}")
+print("Outputs:")
+for text in output_texts:
+    print(f"\t{text}")
+```
+Expected output:
+```text
+Input: hola mundo cómo estás estamos bajo el sol y hace mucho calor santa coloma abre los huertos urbanos a las escuelas de la ciudad
+Outputs:
+	Hola mundo, ¿cómo estás?
+	Estamos bajo el sol y hace mucho calor.
+	Santa Coloma abre los huertos urbanos a las escuelas de la ciudad.
+```
+</details>
+&nbsp;
 # Model Architecture
 This model implements the following graph, which allows punctuation, true-casing, and fullstop prediction
 in every language without language-specific behavior:
 </details>
+&nbsp;
 # Extra Stuff