![Hacker Public Radio artwork](https://is2-ssl.mzstatic.com/image/thumb/Podcasts113/v4/6e/8f/a8/6e8fa80a-9edd-225f-394b-2a6c8c473614/mza_6915183156681029220.png/100x100bb.jpg)
HPR3596: Extracting text, tables and images from docx files using Python
Hacker Public Radio
English - May 16, 2022 00:00 - 4.57 MB - ★★★★ - 34 ratingsTechnology News Tech News community radio tech interviews linux open hobby software freedom Homepage Download Apple Podcasts Google Podcasts Overcast Castro Pocket Casts RSS feed
Previous Episode: HPR3595: I am sure I changed my password last...???
Next Episode: HPR3597: Good Idea Fairy Hunting
Tools to extract data from docx files:
docx2txt
python-docx2txt
python-docx
Code Snippets
text = docx2txt.process(src, img_dest)
with open("data.txt", "wt") as f:
f.write(text)
document = docx.Document(src)
tables = document.tables
data = []
for table in tables:
table_data = []
for row in table.rows:
row_data = []
for cell in row.cells:
row_data.append(cell.text)
table_data.append(row_data)
data.append(table_table)
for i, table in enumerate(tables):
with open(f"{i}.csv", "wt") as f:
writer = csv.writer(f)
writer.writerows(table)