Installing NLTK in Linux
To install NLTK in Ubuntu, run the command below:
NLTK packages are available in all major Linux distributions. Search for the keyword “NLTK” in the package manager to install the packages. If for some reason, NLTK is not available in the repositories of your distribution, you can install it from the pip package manager by running the command below:
Note that you will first have to install pip from your package manager for the above command to work. On some distributions, it may be called pip3. You can also follow detailed installation instructions available on the official website of NLTK.
Extracting Sentences from a Paragraph Using NLTK
For paragraphs without complex punctuations and spacing, you can use the built-in NLTK sentence tokenizer, called “Punkt tokenizer,” that comes with a pre-trained model. You can also use your own trained data models to tokenize text into sentences. Custom-trained data models are out of the scope of this article, so the code below will use the built-in Punkt English tokenizer. To download the Punkt resource file, run the following three commands in succession, and wait for the download to finish:
$ import nltk
$ nltk.download('punkt')
A paragraph from “Alice’s Adventures in Wonderland” will be used in the code sample below:
para = '''Either the well was very deep, or she fell very slowly, for she had
plenty of time as she went down to look about her and to wonder what was going
to happen next. First, she tried to look down and make out what she was coming to,
but it was too dark to see anything; then she looked at the sides of the well, and
noticed that they were filled with cupboards and book-shelves; here and there she
saw maps and pictures hung upon pegs. She took down a jar from one of the shelves
as she passed; it was labelled 'ORANGE MARMALADE', but to her great disappointment it
was empty: she did not like to drop the jar for fear of killing somebody, so managed
to put it into one of the cupboards as she fell past it.'''
tokens = nltk.sent_tokenize(para)
for t in tokens:
print (t, "\n")
Running the above code will give you the following output:
she went down to look about her and to wonder what was going to happen next.
First, she tried to look down and make out what she was coming to, but it was too dark
to see anything; then she looked at the sides of the well, and noticed that they were
filled with cupboards and book-shelves; here and there she saw maps and pictures hung
upon pegs.
She took down a jar from one of the shelves as she passed; it was labelled 'ORANGEMARMALADE',
but to her great disappointment it was empty: she did not like to drop the jar for fear of
killing somebody, so managed to put it into one of the cupboards as she fell past it.
The built-in Punkt sentence tokenizer works well if you want to tokenize simple paragraphs. After importing the NLTK module, all you need to do is use the “sent_tokenize()” method on a large text corpus. However, the Punkt sentence tokenizer may not correctly detect sentences when there is a complex paragraph that contains many punctuation marks, exclamation marks, abbreviations, or repetitive symbols. It is not possible to define a standard way to overcome these issues. You will have to write custom code for tackling these issues using regex, string manipulation, or by training your own data model instead of using the built-in Punkt data model.
You can also try tweaking the existing Punkt model to fix incorrect tokenization by using some additional parameters. To do so, follow the official Punkt tokenization documentation available here. To use your own custom tweaks, a slight change to the code is required:
para = '''Either the well was very deep, or she fell very slowly, for she had plenty
of time as she went down to look about her and to wonder what was going to happen
next. First, she tried to look down and make out what she was coming to, but it was
too dark to see anything; then she looked at the sides of the well, and noticed
that they were filled with cupboards and book-shelves; here and there she saw maps
and pictures hung upon pegs. She took down a jar from one of the shelves as she
passed; it was labelled 'ORANGE MARMALADE', but to her great disappointment it was
empty: she did not like to drop the jar for fear of killing somebody, so managed to
put it into one of the cupboards as she fell past it.'''
punkt_params = PunktParameters()
punkt_params.abbrev_types = set(['Mr', 'Mrs', 'LLC'])
tokenizer = PunktSentenceTokenizer(punkt_params)
tokens = tokenizer.tokenize(para)
for t in tokens:
print (t, "\n")
The code above does the same job as the “sent_tokenize()” method. However, you can now define your own rules using built-in methods and pass them as arguments, as described in the documentation. For example, some abbreviations have been added to the code above. If these abbreviations are followed by punctuation, they won’t be broken into a new sentence. The normal behavior is to use a dot or period as an indication of the end of a sentence.
Conclusion
NLTK and its tokenization methods are quite efficient at tokenizing and processing text data. However, the pre-trained models may not work 100% with different types of texts. You may need to improve the existing models, train and supply your own, or write your own code to fix anomalies.
from Linux Hint https://ift.tt/3k07CAM
0 Comments