How to Extract Sentences from Text Using the NLTK Python Module

The Natural Language Toolkit (NLTK) is a language and text processing module for Python. NLTK can analyze, process, and tokenize text available in many different languages using its built-in library of corpora and large pool of lexical data. Python is one of the most popular programming languages used in data science and language processing, mainly due to the versatility of the language and the availability of useful modules like NLTK. This article will explain how to extract sentences from text paragraphs using NLTK. The code in this guide has been tested with Python 3.8.2 and NLTK 3.4.5 on Ubuntu 20.04 LTS.

Installing NLTK in Linux

To install NLTK in Ubuntu, run the command below:

$ sudo apt install python3-nltk

NLTK packages are available in all major Linux distributions. Search for the keyword “NLTK” in the package manager to install the packages. If for some reason, NLTK is not available in the repositories of your distribution, you can install it from the pip package manager by running the command below:

$ pip install --user -U nltk

Note that you will first have to install pip from your package manager for the above command to work. On some distributions, it may be called pip3. You can also follow detailed installation instructions available on the official website of NLTK.

Extracting Sentences from a Paragraph Using NLTK

For paragraphs without complex punctuations and spacing, you can use the built-in NLTK sentence tokenizer, called “Punkt tokenizer,” that comes with a pre-trained model. You can also use your own trained data models to tokenize text into sentences. Custom-trained data models are out of the scope of this article, so the code below will use the built-in Punkt English tokenizer. To download the Punkt resource file, run the following three commands in succession, and wait for the download to finish:

$ python3
$ import nltk
$ nltk.download('punkt')

A paragraph from “Alice’s Adventures in Wonderland” will be used in the code sample below:

import nltk

para = '''Either the well was very deep, or she fell very slowly, for she had
plenty of time as she went down to look about her and to wonder what was going
to happen next. First, she tried to look down and make out what she was coming to,
but it was too dark to see anything; then she looked at the sides of the well, and
noticed that they were filled with cupboards and book-shelves; here and there she
saw maps and pictures hung upon pegs. She took down a jar from one of the shelves
as she passed; it was labelled 'ORANGE MARMALADE', but to her great disappointment it
was empty: she did not like to drop the jar for fear of killing somebody, so managed
to put it into one of the cupboards as she fell past it.'''


tokens = nltk.sent_tokenize(para)
for t in tokens:
print (t, "\n")

Running the above code will give you the following output:

Either the well was very deep, or she fell very slowly, for she had plenty of time as
 she went down to look about her and to wonder what was going to happen next.

First, she tried to look down and make out what she was coming to, but it was too dark
to see anything; then she looked at the sides of the well, and noticed that they were
filled with cupboards and book-shelves; here and there she saw maps and pictures hung
upon pegs.

She took down a jar from one of the shelves as she passed; it was labelled 'ORANGEMARMALADE',
but to her great disappointment it was empty: she did not like to drop the jar for fear of
killing somebody, so managed to put it into one of the cupboards as she fell past it.

The built-in Punkt sentence tokenizer works well if you want to tokenize simple paragraphs. After importing the NLTK module, all you need to do is use the “sent_tokenize()” method on a large text corpus. However, the Punkt sentence tokenizer may not correctly detect sentences when there is a complex paragraph that contains many punctuation marks, exclamation marks, abbreviations, or repetitive symbols. It is not possible to define a standard way to overcome these issues. You will have to write custom code for tackling these issues using regex, string manipulation, or by training your own data model instead of using the built-in Punkt data model.

You can also try tweaking the existing Punkt model to fix incorrect tokenization by using some additional parameters. To do so, follow the official Punkt tokenization documentation available here. To use your own custom tweaks, a slight change to the code is required:

from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters

para = '''Either the well was very deep, or she fell very slowly, for she had plenty
of time as she went down to look about her and to wonder what was going to happen
next. First, she tried to look down and make out what she was coming to, but it was
too dark to see anything; then she looked at the sides of the well, and noticed
that they were filled with cupboards and book-shelves; here and there she saw maps
and pictures hung upon pegs. She took down a jar from one of the shelves as she
passed; it was labelled 'ORANGE MARMALADE', but to her great disappointment it was
empty: she did not like to drop the jar for fear of killing somebody, so managed to
put it into one of the cupboards as she fell past it.'''


punkt_params = PunktParameters()
punkt_params.abbrev_types = set(['Mr', 'Mrs', 'LLC'])
tokenizer = PunktSentenceTokenizer(punkt_params)
tokens = tokenizer.tokenize(para)

for t in tokens:
    print (t, "\n")

The code above does the same job as the “sent_tokenize()” method. However, you can now define your own rules using built-in methods and pass them as arguments, as described in the documentation. For example, some abbreviations have been added to the code above. If these abbreviations are followed by punctuation, they won’t be broken into a new sentence. The normal behavior is to use a dot or period as an indication of the end of a sentence.

Conclusion

NLTK and its tokenization methods are quite efficient at tokenizing and processing text data. However, the pre-trained models may not work 100% with different types of texts. You may need to improve the existing models, train and supply your own, or write your own code to fix anomalies.



from Linux Hint https://ift.tt/3k07CAM

Post a Comment

0 Comments