Python 3 PyMuPDF Script to Extract Text From PDF and Save it in TXT File Using Fitz Module

0:00
uh Hello friends welcome to this video
0:02
so in this video we will actually be
0:04
talking about a very simple python
0:07
script which allows you to actually
0:10
extract all the data all the text from
0:13
the PDF document so you will basically
0:15
see this is actually the PDF document
0:17
that we are taking for example and there
0:19
are three pages available in this PDF
0:21
document this is the first page this is
0:23
the second page and this is the third
0:24
page and all the text which is written
0:26
inside this PDF document we will try to
0:29
extract it using using a very simple
0:30
python script and we will be actually be
0:33
using this module which is Fitz fits
0:38
module and the name of the library is
0:41
Pym PDF Pym PDF it's basically a python
0:46
library for working with PDF documents
0:49
and it is used for data extraction
0:51
analysis conversion and manipulation of
0:53
PDF this is their official website you
0:56
can read the more documentation more
0:58
about it so the simple command is there
1:01
you just need to go to your command line
1:03
and you should have python installed on
1:04
your machine I have this version you
1:06
will see that
1:09
uh three the latest version of python
1:11
installed on my machine so you just need
1:14
to go to V vs code and just open your
1:17
command line and pip install py pdf2 and
1:22
fi TZ so these are the two commands you
1:24
need to have pypd sorry this is not py m
1:30
U PDF and Fitz so these are the two
1:33
modules that you need to install Pym PDF
1:36
and Fitz so I have already installed
1:38
this so I will now be starting this uh
1:42
script here writing it step by step so
1:45
the very first thing we need to import
1:46
is
1:50
the we need to import the fits and the
1:53
Pym PDF library at the very top you will
1:56
simply say import fit TZ so we have
2:00
successfully imported that and after
2:03
this we will actually be writing the
2:05
main function right here so just give it
2:07
a if condition right here and we will
2:11
have this main function inside this
2:12
python
2:14
script so here you need to gives the
2:17
path of the input PDF file so here we
2:21
will simply write here PDF uncore path
2:24
so my PDF file is present in the same
2:26
directory you will see that sample.pdf
2:28
it is present in the same directory so I
2:30
will simply give the name of the file
2:33
sample.pdf and then we need to give the
2:35
output file name here which will
2:37
actually get created automatically after
2:39
executing the script so output.txt you
2:42
can give any name of your choice so
2:44
after this we will actually write a very
2:46
basic function which is extract text
2:49
from PDF so this will be a custom
2:52
function we will simply write here and
2:53
here we need to pass these two things as
2:56
arguments to this function so the very
2:58
first thing is PDF path and and the text
3:01
path so now we need to Define this
3:04
function right here at the very top so
3:06
in Python we can Define external
3:08
function using this def keyword so here
3:11
we can simply say extract text from PDF
3:15
so there will be two arguments which we
3:17
need to pass right here in this function
3:19
PDF path and txt
3:22
path so now inside this function right
3:24
here we just need to do the necessary
3:27
step in order to actually extract the
3:30
data from the PDF document
3:34
so so the very first thing we will do we
3:36
will simply open the PDF document
3:40
programmatically so we can simply use
3:42
this basic function which is available
3:44
in this Library fits and it has a open
3:47
function we will simply use the open
3:50
function right here to actually open the
3:52
input PDF path so you can see that PDF
3:56
path is actually passed as an argument
3:58
right here we are passing it so
3:59
sample.pdf so it will actually open this
4:02
file programmatically using this open
4:04
function right here so after opening
4:07
this file we now need to actually
4:10
extract all the data all the text which
4:13
is written inside the PDF
4:16
document so we we will simply again use
4:19
the open function right here and this
4:22
time we will actually open the txt file
4:26
right here we will actually create this
4:28
file in this root directory so for
4:31
actually reading the file or creating
4:33
the file we in both the cases we use the
4:36
open function so now we are actually
4:38
writing the file text path so here we
4:40
need to provide so there are a list of
4:43
flags out there so w stands for right R
4:46
stands for read so here we need to
4:49
provide the right flag because we are
4:51
actually creating the text file as an
4:54
output and the third parameter we need
4:56
to give is the encoding which is default
4:59
encoding is utf-8 for if you are doing
5:02
text file and then we need to Simply say
5:05
as text file so we have simply declaring
5:08
this variable right here as text file
5:10
this is actually the format of this open
5:13
synext right here so we are simply
5:15
opening or creating a text file in the
5:17
same directory and uh this will
5:21
output.txt will get created after you
5:23
execute the script
5:26
so now we just need to iterate
5:30
go over each page in the PDF document
5:33
and extract all the
5:35
text very basic step so we need to
5:38
iterate of all the pages first of all we
5:40
need to calculate how many pages are
5:41
there in the PDF document so you will
5:43
see in the in input PDF document there
5:45
are three pages so it will actually
5:48
first of all calculate the total number
5:49
of pages so for calculating the total
5:52
number of pages we have a function Right
5:55
Here length function and here we can
5:59
pass our document right
6:01
here so this length function guys will
6:04
return the actual number of pages which
6:06
are there in the PDF document so the
6:09
this document that we are passing it to
6:10
the length function it will actually
6:12
return the number of pages which are
6:14
present so in this case it will return
6:16
three pages so for all the three pages
6:19
it will we need to do the same step so
6:22
that's why you we are using the for Loop
6:24
here so the very first thing we will do
6:26
here we need need to load that page so
6:28
there is this method available built-in
6:31
method available in this Library so we
6:33
will simply load this page and we need
6:35
to pass the actual page number as an
6:37
argument to this function so it will
6:39
actually programmatically load that page
6:42
in this python script and after that we
6:45
just need to
6:47
extract the text right here in which is
6:50
present in the PDF so now to extract the
6:53
text again a built-in function is
6:55
available guys in this Library which is
6:59
get
7:01
text getor
7:05
text so it's a very powerful Library it
7:08
basically contains all these pre-built
7:10
methods that you can use so one such
7:12
method is the in order to get the text
7:14
in each page so after we get this text
7:17
we now need to Simply write this text in
7:20
the file we need to write the text to
7:22
the txt file so for writing it we will
7:25
simply use text Dot text file dot write
7:31
so this function is available in the
7:33
python library because we are actually
7:36
using this variable you can see text
7:38
file and it contains a write method and
7:41
here we need to write this text that we
7:43
extracted so after
7:45
we write the text we need to Simply move
7:48
to the next line so again we will write
7:51
a new line character so new line new
7:54
line so this will differentiate if you
7:57
are if you are going to the next page we
7:59
need to shift to the new line corrector
8:02
so we are simply typing the new line
8:04
corrector twice right here this
8:06
completes the script guys so now if I
8:10
try to run the
8:12
script you will basically see in the
8:15
left hand
8:16
side if I just write here python
8:21
app.py you will see in the left hand
8:23
side uh output.txt will be created
8:29
so let me open this in the um in the
8:33
wrong directory so now let me just type
8:35
here python app.py so you will see a
8:38
notification comes right here extracted
8:40
text from PDF and in the left hand side
8:43
a new file has been created if you open
8:45
this file you will see all the text has
8:47
been successfully extracted in this file
8:49
here you will see
8:51
that so in this easy way guys you can
8:54
extract any text from a PDF document and
8:57
save it as a txt file uh this is really
9:00
used in various scenarios you need to
9:03
basically save your text from PDF so in
9:07
those situations you can use this script
9:09
it's very simple I explained you step by
9:12
step how we imported that module then we
9:14
EXT Define this function we open the PDF
9:17
document first of all using the open
9:19
function and then we actually did a
9:23
simple for Loop and uh iterated all the
9:27
pages which are present in the PDF
9:29
document we loaded that page we
9:31
extracted text then we write text in the
9:33
PDF txt file and that's all so these are
9:37
all the steps which are necessary for
9:39
this application so thank you very much
9:42
for watching this video so you can try
9:44
out with the unlimited number of pages
9:46
uh I just taken an example right here
9:48
which contains three pages so if you
9:50
have a lot of pages as well it will
9:52
definitely work the script so make sure
9:55
that your PDF file is present in the
9:57
same directory while you're working
9:59
while you're running the script right
10:01
here so just make sure that you provide
10:03
the correct path so this is actually the
10:07
pyp uh fits and Pym PDF Library here we
10:12
can use the python code for extracting
10:15
the text from the PDF document so thank
10:17
you very much for watching this video
10:19
please hit that like button subscribe
10:20
the channel and I will be seeing you in
10:22
the next video

Python 3 PyMuPDF Script to Extract Text From PDF and Save it in TXT File Using Fitz Module

webninjadeveloper.com

Reinforcement Learning Algorithms | Machine Learning Tutorial | TutorialsPoint

Python 3 Pillow Script to Export Text to PNG & JPG Image With Custom Font & Colors

Python 3 Script to Compress All Files in a Folder and Save it as ZIP File Using zipfile Library

Python 3 Subprocess Module Tutorial to Run Terminal & Shell Commands Programmatically

This VSCode Extension Makes Your TypeScript Errors Prettier, Formatted & Human Readable

Python Exercise #02: Create a Python program to calculate interest on a bank deposit

Python Script to Extract Text From PDF Using PyPDF2 Library in Terminal

Python 3 pypdf Script to Extract Text From PDF Document and Save it as TXT File

Tag Database Tutorial for Automation Direct Mini PLCs!

npm & pnpm cannot be loaded because running script is disabled set execution policy SOLVED !!

Python 3 weasyprint Library Example to Convert HTML & CSS Template to PDF Document

ACC PLC Simulator – How to Use the Free Browser Based PLC Trainer!

How To Read Command Line Arguments in Python

How to Develop and Debug Python Code with VSCode

Python input Function

Up next in 10

Python 3 PyMuPDF Script to Extract Text From PDF and Save it in TXT File Using Fitz Module

webninjadeveloper.com