HPR3596: Extracting text, tables and images from docx files using Python

Hacker Public Radio

English - May 16, 2022 00:00 - 4.57 MB - ★★★★ - 34 ratings
Technology News Tech News community radio tech interviews linux open hobby software freedom Homepage Download Apple Podcasts Google Podcasts Overcast Castro Pocket Casts RSS feed

Previous Episode: HPR3595: I am sure I changed my password last...???

Next Episode: HPR3597: Good Idea Fairy Hunting

Tools to extract data from docx files:

docx2txt
python-docx2txt
python-docx

Code Snippets
text = docx2txt.process(src, img_dest)
with open("data.txt", "wt") as f:
f.write(text)

document = docx.Document(src)
tables = document.tables
data = []
for table in tables:
table_data = []
for row in table.rows:
row_data = []
for cell in row.cells:
row_data.append(cell.text)
table_data.append(row_data)
data.append(table_table)

for i, table in enumerate(tables):
with open(f"{i}.csv", "wt") as f:
writer = csv.writer(f)
writer.writerows(table)