Extract a sentence containing a word using Python & hellip; In addition to the sentenc…

综合编程 2018-02-14

There are a bunch of questions that get at extracting a particular sentence that contains a word (like extract a sentence using python and Python extract sentence containing word), and I have enough beginner experience with NLTK and SciPy to be able to do that on my own.

However, I'm getting stuck trying to extract a sentence containing a word... as well as the sentences before and after the target sentence.

For example:

"I was walking along to school the other day, when it began to rain. I reached for my umbrella, but I realized I had forgotten it at home. What could I do? I immediately ran for the nearest tree. But then I realized I couldn't stay try with a tree without any leaves."

In this example, the target word is "could." If I wanted to extract the target sentence ( What could I do?
) as well as the preceding and following sentences ( I reached for my umbrella, but I realized I had forgotten it at home.
and I immediately ran for the nearest tree.
), what would be a good approach?

Assume I have each paragraph sectioned off as its own text...

for paragraph in document:
    do something

... is there a proper way to tackle this question? I have about 10,000 paragraphs with varying numbers of sentences around the target word (which appears is every single paragraph).

What about something like this?

import nltk.data
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
for paragraph in document:
    paragraph_sentence_list = tokenizer.tokenize(paragraph)
    for line in xrange(0,len(paragraph_sentence_list)):
        if 'could' in paragraph_sentence_list[line]:

            print(paragraph_sentence_list[line])

            try:
                print(paragraph_sentence_list[line-1])
            except IndexError as e:
                print('Edge of paragraph. Beginning.')
                pass

            try:
                print(paragraph_sentence_list[line+1])
            except IndexError as e:
                print('Edge of paragraph. End.')
                pass

What this does is break the paragraphs into a list of sentences.

The iterating over the sentences tests if 'could' is in the setence. If it is, then it prints the previous index [line-1], the current index [line] and the next index [line+1]

Hello, buddy!

责编内容by:Hello, buddy! (源链)。感谢您的支持!

您可能感兴趣的

Python常用模块之sys sys模块提供了一系列有关Python运行环境的变量和函数。 sys模块的常见函数列表 sys.argv : 实现从程序外部向程序传递参数。 ...
Matplotlib Scatterplot legend for points I am programmatically creating a scatterplot like this: (Ipython sample code) %matplotlib inline f...
Python vs. R The '90s were responsible for a number of incredible developments, including the internet, which for...
In Python, how do you test the existence... I want to check for the existence of the following file: $ANALYSISDIRECTORY/data/AnalysisDerivativ...
数据挖掘之matplotlib入门 简单介绍 matplotlib库是Python数据挖掘中的库之一,主要用于2D绘图,简单的3D绘图,数据可视化的库。 Python数据挖掘相关扩展库 ...