Python 3 PyPDF2 Script to Extract Text From PDF Document and Save it as TXT File
Jan 9, 2025
Official Website:
https://freemediatools.com
Show More Show Less View Video Transcript
0:00
uh hello guys welcome to this video so
0:02
in this video we will be looking at
0:04
another Library which allows you to
0:06
actually extract text from the PDF
0:08
document and in this specific tutorial
0:11
this time we will be using a py pdf2
0:14
library this is again a open source
0:16
python Library if you just type on
0:18
Google right here
0:20
pypdf2 so the name of the library comes
0:23
right here you will see it has its own
0:25
package right here the command is very
0:27
simple which is PIP install pypd F2 the
0:30
latest version is
0:32
3.0.1 and this is again a very famous
0:35
python library for working with PDF
0:38
documents you can just perform all the
0:40
operations with it you can split PDF
0:42
merge crop transform all these
0:45
operations you can perform one such
0:47
operation is actually extracting the
0:49
text which is written in the PDF
0:50
document so let's suppose we have this
0:52
PDF file present in the same directory
0:55
let me just open it in the browser you
0:57
will see there are three pages out there
1:00
each page is containing some text right
1:01
here this is the first page this is a
1:03
second page this is a third page so now
1:06
I will simply write a simple python
1:08
script which will actually use this
1:09
library py pdf2 and programmatically we
1:13
will actually save all this text in a p
1:15
txt file and let me just run this PDF uh
1:19
python script right here so I as I write
1:22
here python app 2 ypy you will see in
1:25
the left hand side a file will be
1:28
created so if I exit execute it so you
1:31
will basically see a notification
1:33
messages there text extracted and save
1:36
to output.txt so if you open this
1:39
output.txt you will basically see all
1:41
the text has been
1:42
extracted in the txt file so in this way
1:46
you can extract all the text from the
1:49
PDF document and save it as a txt file
1:52
so now I will basically write the code
1:55
step by step so the very first thing you
1:57
should have python installed on your
1:59
machine I am using the latest version
2:01
python 3.0 version and then just write
2:05
this command pip install
2:08
pypdf2 this is the name of the package
2:10
pip install py pdf2 so you just need to
2:14
install this package so now at the very
2:16
first line right here we will import
2:18
this package
2:20
py pdf2 so we have imported this package
2:24
and then we have a simple main function
2:27
of python
2:31
so every Python program has this main
2:33
function so here we need to provide our
2:36
PDF path so this PDF path is present in
2:40
the same directory guys you will see
2:41
sample.pdf so simply I will write
2:44
sample.pdf and then after this we here
2:47
we need to provide our text
2:49
path so this will be automatically be
2:52
created in the root directory as well
2:54
this file output.txt so you can just
2:57
change the name as well and then then we
3:00
will Define this custom function that we
3:02
will write extract text
3:05
from PDF we will pass these two as
3:08
arguments right here PDF path and txt
3:11
path and then we will simply write a
3:14
simple notification that uh PDF text has
3:18
been
3:20
extracted and
3:22
saved that's all now we just need to
3:24
Define this function right here so in
3:26
Python we can Define functions using def
3:28
keyword extract text from PDF and here
3:34
you'll simply write the two arguments
3:37
right here which will be passed PDF path
3:39
and txt
3:42
path so right here guys what we need to
3:46
do right here we need to first of all
3:47
open the PDF file so we'll simply use
3:51
the open function right here once again
3:54
and here we will pass PDF path and we
3:56
will be using the read binary mode
4:00
so we will read the binary of the PDF
4:03
document and we will read it as this
4:05
variable PDF file this is really
4:07
important we are actually using the open
4:09
function we are actually opening the PDF
4:11
file as a read binary mode and this is
4:14
actually the variable that we will be
4:15
using in this Loop in this open function
4:19
so now to read this we will simply be
4:22
again be using a module which is present
4:24
inside this module py pdf2 and it
4:27
contains a PDF reader
4:32
method PDF reader method it contains
4:34
right here and here we need to pass the
4:36
name of the PDF file like this so here
4:40
we are actually using the PDF reader
4:42
method of
4:44
pypdf2 library and we are actually
4:47
reading the PDF
4:49
file and now we just need to create the
4:51
txt file so again we will use that open
4:54
function this time we will provide write
4:57
mode because we are writing the file we
4:59
creating the file in the root directory
5:02
so these are different flags W stands
5:05
for write re R stands for read so we are
5:08
writing the file that's why we are
5:10
providing right flag and then we need to
5:12
provide the encoding which is
5:15
utf8 by
5:17
default and then we need to give it a
5:20
variable which is text file colon and
5:24
then you need to calculate total number
5:26
of pages which are present in the input
5:28
PDF file now for calculating the total
5:31
number of pages you will see in this
5:32
file it is three pages so you need to
5:35
calculate programmatically so there is
5:37
actual method which is there which is
5:39
length function and here PDF reader
5:45
this variable contains a property called
5:49
as dot Pages this will be a simple
5:53
number and here we are actually C
5:56
passing it to the length function it
5:58
will calculate it
6:00
the total number of pages so now for
6:02
each page we will first of all get the
6:04
page for getting the page we will simply
6:08
say here PDF reader and we will simply
6:11
say here
6:14
Pages property and then square bracket
6:17
and then we will pass the individual
6:19
page number we will simply iterate
6:20
through all the pages which are present
6:22
in the PDF document after calculating
6:24
how many pages are there we will get
6:27
access to each each individual page and
6:30
then the next step will be to actually
6:32
extract the text from the PDF page so
6:37
now to extract that PDF text right here
6:41
we have again a method present page.
6:44
extract text this is actual method which
6:47
will simply extract the text from the p
6:50
uh actual page of the PDF and then we
6:53
will simply write this text simply use
6:57
the write function and write the text
7:01
and again we will simply write it a new
7:03
line corrector so that it appears on the
7:06
new
7:07
line that's all so this completes the
7:10
script guys that's all that we need to
7:12
do inside this Python program and if you
7:15
now run this
7:16
file if I change the output name right
7:20
here of the file to result.txt so just
7:24
see in the left hand side as I run this
7:27
file python app
7:33
py so you will see PDF text has been
7:37
extracted and saved and you will see
7:39
result.txt has been successfully created
7:42
and it has got all the text which are
7:44
written in the PDF document so you will
7:47
see all this text has been saved in a
7:49
txt file so in this way you can use the
7:52
pypdf2 library in Python to actually
7:54
extract the text from PDF document and
7:57
save it in a txt file so thank you very
8:00
much for watching this video please hit
8:01
that like button subscribe the channel
8:03
as well and I will be seeing you in the
8:05
next video
#Other
#Educational Software
#Computer Education
