1-800-BAD-CODE
commited on
Commit
·
23dc508
1
Parent(s):
c224a7e
make model card simpler
Browse files
README.md
CHANGED
@@ -69,8 +69,17 @@ and detect sentence boundaries (full stops) in 47 languages.
|
|
69 |
|
70 |
# Usage
|
71 |
|
|
|
|
|
|
|
|
|
72 |
## Usage via `punctuators` package
|
73 |
|
|
|
|
|
|
|
|
|
|
|
74 |
The easiest way to use this model is to install [punctuators](https://github.com/1-800-BAD-CODE/punctuators):
|
75 |
|
76 |
```bash
|
@@ -180,6 +189,7 @@ Outputs:
|
|
180 |
|
181 |
</details>
|
182 |
|
|
|
183 |
|
184 |
## Manual Usage
|
185 |
If you want to use the ONNX and SP models without wrappers, see the following example.
|
@@ -305,11 +315,16 @@ Outputs:
|
|
305 |
|
306 |
|
307 |
# Model Architecture
|
|
|
308 |
This model implements the following graph, which allows punctuation, true-casing, and fullstop prediction
|
309 |
in every language without language-specific behavior:
|
310 |
|
311 |
![graph.png](https://s3.amazonaws.com/moonup/production/uploads/62d34c813eebd640a4f97587/jpr-pMdv6iHxnjbN4QNt0.png)
|
312 |
|
|
|
|
|
|
|
|
|
313 |
We start by tokenizing the text and encoding it with XLM-Roberta, which is the pre-trained portion of this graph.
|
314 |
|
315 |
Then we predict punctuation before and after every subtoken.
|
@@ -330,8 +345,14 @@ modeled as a multi-label problem. This allows for upper-casing arbitrary charact
|
|
330 |
|
331 |
Applying all these predictions to the input text, we can punctuate, true-case, and split sentences in any language.
|
332 |
|
|
|
|
|
333 |
## Tokenizer
|
334 |
|
|
|
|
|
|
|
|
|
335 |
Instead of the hacky wrapper used by FairSeq and strangely ported (not fixed) by HuggingFace, the `xlm-roberta` SentencePiece model was adjusted to correctly encode
|
336 |
the text. Per HF's comments,
|
337 |
|
@@ -373,6 +394,7 @@ with open("/path/to/new/sp.model", "wb") as f:
|
|
373 |
|
374 |
Now we can use just the SP model without a wrapper.
|
375 |
|
|
|
376 |
|
377 |
## Post-Punctuation Tokens
|
378 |
This model predicts the following set of punctuation tokens after each subtoken:
|
|
|
69 |
|
70 |
# Usage
|
71 |
|
72 |
+
If you want to just play with the model, the widget on this page will suffice. To use the model offline,
|
73 |
+
the following snippets show how to use the model both with a wrapper (that I wrote, available from `PyPI`)
|
74 |
+
and manual usuage (using the ONNX and SentencePiece models in this repo).
|
75 |
+
|
76 |
## Usage via `punctuators` package
|
77 |
|
78 |
+
|
79 |
+
<details>
|
80 |
+
|
81 |
+
<summary>Click to see usage with wrappers</summary>
|
82 |
+
|
83 |
The easiest way to use this model is to install [punctuators](https://github.com/1-800-BAD-CODE/punctuators):
|
84 |
|
85 |
```bash
|
|
|
189 |
|
190 |
</details>
|
191 |
|
192 |
+
</details>
|
193 |
|
194 |
## Manual Usage
|
195 |
If you want to use the ONNX and SP models without wrappers, see the following example.
|
|
|
315 |
|
316 |
|
317 |
# Model Architecture
|
318 |
+
|
319 |
This model implements the following graph, which allows punctuation, true-casing, and fullstop prediction
|
320 |
in every language without language-specific behavior:
|
321 |
|
322 |
![graph.png](https://s3.amazonaws.com/moonup/production/uploads/62d34c813eebd640a4f97587/jpr-pMdv6iHxnjbN4QNt0.png)
|
323 |
|
324 |
+
<details>
|
325 |
+
|
326 |
+
<summary>Click to see graph explanations</summary>
|
327 |
+
|
328 |
We start by tokenizing the text and encoding it with XLM-Roberta, which is the pre-trained portion of this graph.
|
329 |
|
330 |
Then we predict punctuation before and after every subtoken.
|
|
|
345 |
|
346 |
Applying all these predictions to the input text, we can punctuate, true-case, and split sentences in any language.
|
347 |
|
348 |
+
</details>
|
349 |
+
|
350 |
## Tokenizer
|
351 |
|
352 |
+
<details>
|
353 |
+
|
354 |
+
<summary>Click to see how the XLM-Roberta tokenizer was un-hacked</summary>
|
355 |
+
|
356 |
Instead of the hacky wrapper used by FairSeq and strangely ported (not fixed) by HuggingFace, the `xlm-roberta` SentencePiece model was adjusted to correctly encode
|
357 |
the text. Per HF's comments,
|
358 |
|
|
|
394 |
|
395 |
Now we can use just the SP model without a wrapper.
|
396 |
|
397 |
+
</details>
|
398 |
|
399 |
## Post-Punctuation Tokens
|
400 |
This model predicts the following set of punctuation tokens after each subtoken:
|