Python 3 scrapy Library Script to Build Spider to Crawl Website URL's and Generate XML Sitemap File
27 views
Jun 3, 2025
Get the full source code of application here:
View Video Transcript
0:00
uh hello guys welcome to this video So
0:02
in this video I will show you a new
0:04
Python package which is scrappy Scrappy
0:07
is a very popular library inside Python
0:10
It's a crawler kind of a crawler spider
0:14
which is used for web scraping It's a
0:17
web crawling framework and using this
0:20
library we'll be building a simple
0:22
sitemap generator for our website So
0:25
this is their website official website
0:28
Here's scrappy It's an open-source
0:30
Python scraping library So the command
0:33
is simple First of all install this
0:35
library pip install scrappy I've already
0:38
installed it Now to install Now to
0:41
basically make a brand new project you
0:44
come to the command line and simply
0:46
execute this command here which is
0:49
scrappy S C R A P Y scrappy start
0:55
project and then followed by your
0:57
project name So let's suppose I say
0:59
sitemap gen So this is a command here
1:03
scrappy start project sitemap gen So
1:06
whatever your your project name So
1:09
simply enter and then it will start it
1:12
will create your project inside this
1:15
directory So it is giving you
1:16
instructions that cd into this directory
1:19
and
1:19
then start this So I will see cd into
1:27
this
1:29
So if you try to open this it will look
1:32
something like this This is a basic
1:34
structure of a scrappy project This is
1:36
your project name and a set of spiders
1:40
are there Inside this you will see this
1:43
folder spiders is there Go to this
1:45
folder and create your own spider So
1:48
let's suppose I say I want to create
1:56
a
1:59
sitemap
2:01
spider You can define your own spider
2:03
This can be any name but I'm just
2:05
calling it as sitemap spider is equal to
2:08
py So right here you define your spider
2:12
So I will first of all import the
2:15
scrappy package like this And then we
2:19
also need from
2:21
scrappy
2:22
spiders we need to import the crawl
2:25
spider and the root And then from
2:28
scrappy dot link
2:33
extractors we need to
2:39
import link extractor And then from XML
2:44
dot
2:50
E3 after importing all these packages
2:53
now we just need to initialize the class
2:55
here which will be responsible for
2:58
sitemap
3:01
spider and here we'll be passing this
3:05
crawl spider
3:08
Inside this function you first of all
3:10
name your spider whatever is your spider
3:12
name So I will say sitemap
3:14
spider and whatever is the allowed
3:18
domains So here you will paste your
3:20
domain name So I will paste my own
3:23
domain name which is free media
3:24
tools.com I need to generate a sitemap
3:27
of this So after
3:33
that then we will basically create
3:36
another variable here start
3:44
URLs again you need to paste the same
3:46
URL
3:48
http alongside with
3:51
https so here you don't need to write
3:54
https but here you need to write https
3:58
So after
3:59
that we now need to write some rules
4:02
that need to be followed while crawling
4:05
the website So right
4:08
here we define rules like this link
4:14
extractor we also have the call back
4:16
function as well So parse item follow is
4:19
equal to true So this is the thing
4:31
So after that we define a function After
4:35
this we define a function
4:37
parse
4:42
item self
4:46
response we get the URL response dot
4:51
URL self dots save URL
5:07
We'll save the sitemap file So here you
5:09
can name your sitemap whatever you want
5:12
to So I'm just naming it as
5:15
sitemap.xml So after
5:28
that here we write this to save the site
5:33
map So you
5:35
basically construct the site
5:38
map using the URLs which are crawled
5:42
like this and then we contain the right
5:45
method right here to actually write the
5:48
site map at that appropriate location So
5:51
if I run this
5:56
project so the command to run this
5:59
project is really simple You simply
6:03
say you say scrappy and then c r a w l
6:09
scrappy crawl followed by your sitemap
6:13
spider name
6:16
So whatever is your name of the spider
6:19
that you have given sitemap spider
6:22
So just enter
6:27
it and after that the application will
6:30
run and it will create a site
6:33
map Just need to make sure that you
6:35
whatever you call this
6:39
So I think I made a mistake here
6:59
So now you can see it is crawling the
7:02
website here as you can
7:06
see In this easy way you can replace any
7:11
URL So in this way you can see we are
7:14
crawling geeks forge geeks.org org But
7:17
uh here I can replace my own website
7:19
free media tools There was just a typo
7:23
mistake what was there So again if I run
7:26
this
7:37
now so you can see
7:46
So in this way you can do this very
7:53
easily Sorry I made a mistake here This
7:57
need free media tools.com not that was
8:01
the problem
8:08
So now you can see it is crawling my
8:10
website
8:11
freemediatools.com each page and it is
8:14
constructing the site map using this
8:16
scrapey library I found it this library
8:19
to be very useful guys It can actually
8:22
it's a very good web scraping library
8:25
and it can crawl
8:27
anything So you can see it is crawling
8:29
each page and it is storing it inside
8:32
the sitemap file So once it is completed
8:35
it will give you a notification that
8:37
your process has been completed So you
8:40
can use this library I have given all
8:43
the script in the description of the
8:44
video So if you want to get the full
8:47
source code the link is given in the
8:49
description
8:50
So you can see it is crawling each and
8:53
every page and inside the sitemap file
8:56
and can format the
9:13
document So in this way you can
9:16
construct the site map very easily using
9:18
this script So thank you very much for
9:21
watching this video and uh also check
9:23
out my website
9:25
freemediattools.com which contains
9:27
thousands of tools