A frontend to punctuation prediction for Icelandic text
Project description
README
Punctuation Prediction
A python package that punctuates Icelandic text. The input data is unpunctuated text and punctuated text is returned. The user can choose between two punctuation models, a bidirectional RNN (Punctuator 2) in Tensorflow 2, and a pretrained ELECTRA Transformer, fine-tuned for punctuation prediction, based on a Hugging Face NER recipe. The pretrained ELECTRA model was trained by Jón Friðrik Daðason on data from the Icelandic Gigaword corpus. Both punctuation models are trained/fine-tuned using Gigaword corpus data.
Table of Contents
Installation
To install, first create a conda environment:
conda create --name {venv}
Then activate it:
conda activate {venv}
Install the requirement(s):
conda install tensorflow==2.1.0
pip install tensorflow==2.1.0
could also work, if you run into problems.
The transformer models are created with PyTorch, which is needed when reading the models:
conda install pytorch
To use the ELECTRA model one needs access to Hugging Face functions, installed with:
pip install transformers
Then finally, run:
pip install punctuator-isl
Running
The program can be run either from a command line or from inside a python script.
To run it on a command line:
$ punctuate input.txt output.txt
The default model is the biRNN model, you can also specify another model, e.g.:
$ punctuate input.txt output.txt --electra
The input uses stdin
and the output stdout
. Both files are encoded in UTF-8.
Empty lines in the input are treated as sentence boundaries.
Which of the two models to be used can be specified on the command line. The default is biRNN
.
Model | Description |
---|---|
biRNN | The Punctuator 2 model in Tensorflow. |
ELECTRA | The ELECTRA Transformer (HuggingFace) |
For a short help message of how to use the package, type punctuate -h
or punctuate --help
.
The input text should be like directly from automatic speech recognition, without capitalizations or punctuations.
The first time the program is run the punctuation models are downloaded into punctuation_models
, in the user
home directory for Linux and Mac users, and in APPDATA for Windows users. The user can put a new model path in
path_config.json
inside the punctuator directory is another location is desired.
Example
In this case, the default model is used. An input string is specified and the punctuate function returns a punctuated string, words that appear after an end-of-sentence punctuation mark are capitalized.
$ echo "næsti fundur er fyrirhugaður í næstu viku að sögn kristínar jónsdóttur hópstjóra náttúruvárvöktunar hjá veðurstofu íslands verður áfram fylgst grannt með jarðhræringum á svæðinu" | punctuate
$ Næsti fundur er fyrirhugaður í næstu viku. Að sögn kristínar jónsdóttur, hópstjóra náttúruvöktunar hjá veðurstofu íslands, verður áfram fylgst grannt með jarðhræringum á svæðinu.
Python module
The punctuate function
from punctuator-is import punctuate
# A string to be punctuated
s = "næsti fundur er fyrirhugaður í næstu viku að sögn kristínar jónsdóttur hópstjóra náttúruvárvöktunar hjá veðurstofu íslands verður áfram fylgst grannt með jarðhræringum á svæðinu"
punctuated = punctuate(s, model='biRNN')
print(punctuated)
The program should output:
Næsti fundur er fyrirhugaður í næstu viku. Að sögn kristínar jónsdóttur, hópstjóra náttúruvöktunar hjá veðurstofu íslands, verður áfram fylgst grannt með jarðhræringum á svæðinu.
License
This code is licensed under the MIT license.
Authors/Credit
Main authors: Helga Svala Sigurðardóttir - helgas@ru.is, Inga Rún Helgadóttir - ingarun@ru.is
Acknowledgements
This project was funded by the Language Technology Programme for Icelandic 2019-2023. The programme, which is managed and coordinated by Almannarómur, is funded by the Icelandic Ministry of Education, Science and Culture.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for punctuator-lvl9-inga-1.1.29.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4b1c6737b4514e20f6e646eb64ca10f86c0586615bd5aa7eff7d79c4b28ae7ea |
|
MD5 | 34ba265fa8a5ca3f4a4e1aebf5c35a71 |
|
BLAKE2b-256 | 4d36cfe9c6a47c3aa9906d64da5d45918e66c57353af410a886b7ebf6a6ab108 |
Hashes for punctuator_lvl9_inga-1.1.29-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f5d531a875206d356d13d6b86aa861d3850ba966c70a892122a359014d87d8ea |
|
MD5 | 83d7cf5c2ee420a3c8c940f6dd2ce876 |
|
BLAKE2b-256 | 7abe4b354dd44da3f162017b8df393495785da2fc146ddfb66e9f4cefda27afb |