A guide to read word file using Python

综合编程 2018-06-25

This tutorial shows a guide on how to read word file using Python. We know that word is great for documentation. This tutorial also shows how to install docx and nltk modules when not available in Python in Windows Operating System. These modules are required to read word or docx file using Python. This tool is used in many areas and some of them are given below:

You can create all types of official documents in Microsoft Word.

You can create lecture script by using text, word art, shapes, colors, and images.

You can create a birthday card, invitation card in Microsoft Word by using pre-defined templates or using insert menu and format menus functions.

You can highlight basic and advance knowledge of MS Word as great skill in your resume for the job interview.

You can create notes and assignment on MS-word.

You can create and print a book using MS Word by creating a cover page, content, head and footers, image adjustments, text alignment and text highlighter etc.

You can start your business online and offline. You need to create documents for official works.

You can use Microsoft word to collaborate with your team while working on the same project and document.

What’s more, this software is widely used in many different application fields all over the world and it also applies to data science.

You may like to read:

Read excel file using Python

Write excel file using Python

We have seen various operations on word files using wonderful API – Apache POI in Java technology and it requires few more lines of code have to be written to read from or write to word files.

But to read word file using Python is very easy with a few lines of code. We will use a sample word file here to read the word file.

You may also download the sample word file through Google search and give it a try.

Let’s move on to the example…

Prerequisites

Have Python installed in Windows (or Unix)

Pyhton version and Packages

Here I am using Python 3.6.5 version

package – docx, nltk

Preparing your workspace

Preparing your workspace is one of the first things that you can do to make sure that you start off well. The first step is to check your working directory.

When you are working in the Python terminal, you need first navigate to the directory, where your file is located and then start up Python, i.e., you have to make sure that your file is located in the directory where you want to work from.

Check Required Modules

Check for modules docx and nltk in Python terminal. Type the command as shown below to check docx and nltk package. If you do not get any error message then the module exists otherwise you have to install the non-existence module.

>> import docx

>> import nltk

If you do not have docx and nltk module available then please find below steps to install docx and nltk modules in Windows Operating System.

1. Please make sure you open cmd prompt in administrator mode

2. Execute below command to install docx module

Now we will see how to install nltk module

1. Execute below command to install nltk module. Make sure you open cmd prompt in administrator mode.

2. Installing nltk is not enough as shown above, you need to download the required packages. So download using the below command.

3. Now a popup window will open for downloading required packages

4. Once required packages are downloaded, you should see following screen.

You are done installing nltk.

Now let’s move on to the example read word file using Python.

In the below image you see I have opened a cmd prompt and navigated to the directory where I have put the word file that has to be read.

We will read the below word file using Python programming language. We will read the whole content from word file and display those content into Python console. You may read the word file content and do something else for your business using the Python programming.

The above word file should be put into the C:py_scripts directory where we will also put the Python script to read the word file.

Now create a Python script read_word.py under the C:py_scripts for reading the above word file. Here py is extension of the Python file.

In the below Python script notice how we imported docx and nltk module.

The below Python script shows how to read word file using Python.

import docx

#Extract text from DOCX
def getDocxContent(filename):
    doc = docx.Document(filename)
    fullText = ""
    for para in doc.paragraphs:
        fullText += para.text
    return fullText
	
resume = getDocxContent("sample.docx")

#Importing NLTK for sentence tokenizing
from nltk.tokenize import sent_tokenize

sentences = sent_tokenize(resume)
for sentence in sentences:
	print(sentence)
	print("n")

When you execute the above Python script, then you should see the following output in the console.

Here is thesample file.

Hope you understood how to read word file using Python.

Thanks for reading.

责编内容by:Roy Tutorials (源链)。感谢您的支持!

您可能感兴趣的

Python subprocesscall of a zip.exe I have a python script which has zipped a file with the following command: sub...
New Course: Connect a Database to Your Python Appl... Python is a great way to create web apps, but what happens when you need to ...
编程入门14:Python模式匹配 上一篇: 编程入门13:Python文本处理 我们有时需要判断一段文本是否符合特定的“模式”(Pattern),这称为文本模式匹配——例如手机号的模式可...
Twitter Digest 2018 Week 16 RT @DocFast: Stoked! :tada: importnb makes it VERY easy to #pytest no...
Connect HBase with Python and Thrift Apache HBase is a key-value store in Hadoop ecosystem. It is based on HDFS, and...