A guide to read word file using Python

综合编程 2018-06-25 阅读原文

This tutorial shows a guide on how to read word file using Python. We know that word is great for documentation. This tutorial also shows how to install docx and nltk modules when not available in Python in Windows Operating System. These modules are required to read word or docx file using Python. This tool is used in many areas and some of them are given below:

You can create all types of official documents in Microsoft Word.

You can create lecture script by using text, word art, shapes, colors, and images.

You can create a birthday card, invitation card in Microsoft Word by using pre-defined templates or using insert menu and format menus functions.

You can highlight basic and advance knowledge of MS Word as great skill in your resume for the job interview.

You can create notes and assignment on MS-word.

You can create and print a book using MS Word by creating a cover page, content, head and footers, image adjustments, text alignment and text highlighter etc.

You can start your business online and offline. You need to create documents for official works.

You can use Microsoft word to collaborate with your team while working on the same project and document.

What’s more, this software is widely used in many different application fields all over the world and it also applies to data science.

You may like to read:

Read excel file using Python

Write excel file using Python

We have seen various operations on word files using wonderful API – Apache POI in Java technology and it requires few more lines of code have to be written to read from or write to word files.

But to read word file using Python is very easy with a few lines of code. We will use a sample word file here to read the word file.

You may also download the sample word file through Google search and give it a try.

Let’s move on to the example…


Have Python installed in Windows (or Unix)

Pyhton version and Packages

Here I am using Python 3.6.5 version

package – docx, nltk

Preparing your workspace

Preparing your workspace is one of the first things that you can do to make sure that you start off well. The first step is to check your working directory.

When you are working in the Python terminal, you need first navigate to the directory, where your file is located and then start up Python, i.e., you have to make sure that your file is located in the directory where you want to work from.

Check Required Modules

Check for modules docx and nltk in Python terminal. Type the command as shown below to check docx and nltk package. If you do not get any error message then the module exists otherwise you have to install the non-existence module.

>> import docx

>> import nltk

If you do not have docx and nltk module available then please find below steps to install docx and nltk modules in Windows Operating System.

1. Please make sure you open cmd prompt in administrator mode

2. Execute below command to install docx module

Now we will see how to install nltk module

1. Execute below command to install nltk module. Make sure you open cmd prompt in administrator mode.

2. Installing nltk is not enough as shown above, you need to download the required packages. So download using the below command.

3. Now a popup window will open for downloading required packages

4. Once required packages are downloaded, you should see following screen.

You are done installing nltk.

Now let’s move on to the example read word file using Python.

In the below image you see I have opened a cmd prompt and navigated to the directory where I have put the word file that has to be read.

We will read the below word file using Python programming language. We will read the whole content from word file and display those content into Python console. You may read the word file content and do something else for your business using the Python programming.

The above word file should be put into the C:py_scripts directory where we will also put the Python script to read the word file.

Now create a Python script read_word.py under the C:py_scripts for reading the above word file. Here py is extension of the Python file.

In the below Python script notice how we imported docx and nltk module.

The below Python script shows how to read word file using Python.

import docx

#Extract text from DOCX
def getDocxContent(filename):
    doc = docx.Document(filename)
    fullText = ""
    for para in doc.paragraphs:
        fullText += para.text
    return fullText
resume = getDocxContent("sample.docx")

#Importing NLTK for sentence tokenizing
from nltk.tokenize import sent_tokenize

sentences = sent_tokenize(resume)
for sentence in sentences:

When you execute the above Python script, then you should see the following output in the console.

Here is thesample file.

Hope you understood how to read word file using Python.

Thanks for reading.

责编内容by:Roy Tutorials 【阅读原文】。感谢您的支持!


Superset数据分析平台搭建 Superset 是 Airbnb 开源的数据分析与可视化平台,同时也是由 Python 语言构建的轻量级 BI 系统。Superset 可实现对 TB 量级数据进行处理,兼容常见的数十种关系或非关系型数据库,并在内部实现 SQL 编辑查询等操作。除此之外,基于 Web 服务的 Superset 可...
Python OrderedDict Python OrderedDict is adict subclass that maintains the items insertion order. When we iterate over an OrderedDict, items are returned in the orde...
The Many Faces of Python (And How to Manage Them) Python is Definitely Not a Snake What is Python? You probably know it’s a programming language. But is Python a program? A language? An ecosystem? I...
From deep learning papers implementation to shippi... While developing a product from scratch based on deep learning you always end up asking you this question: "How will I ship and maintain my deep learn...
Dionaea蜜罐部署 $ uname -a Linux wsy-virtual-machine 4.4.0-116-generic #140-Ubuntu SMP Mon Feb 12 21:23:04 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux $ lsb_release -a No...