Python 3 BeautifulSoup to Crawl Website & Generate XML Sitemap & Download it to Local Machine
142 views
Jun 3, 2025
Get the full source code of application here: https://codingshiksha.com/python/python-3-beautifulsoup-to-crawl-website-generate-xml-sitemap-download-it-to-local-machine/
View Video Transcript
0:00
uh hello guys welcome to this video so
0:02
in this video we will show you how to
0:05
actually create a XML sitemap inside
0:08
your Python so I have written a Python
0:10
script which will go to a website and
0:12
generate the site map so here you can
0:15
input any website let's suppose I take
0:18
my example
0:20
freetools.com i input this website here
0:23
so it will go to this website and crawl
0:25
it each page and create a XML site map
0:29
so if I just run this Python script
0:32
here so now what happens you can see
0:35
sitemap is created and saved as
0:38
sitemap.xml so you can see that if I
0:41
open this each page has been crawled and
0:44
this XML sitemap you can see has been
0:47
created so each page has been crawled
0:50
and it is present inside this XML
0:52
sitemap so this is a very useful Python
0:54
script guys i have given the script in
0:56
the description of this video so you can
0:59
see
1:00
that this is
1:02
exactly this is actually the website
1:05
here it it has gone to this website and
1:08
has crawled each page and created this
1:10
XML sitemap so for this uh Python script
1:14
we are using a library called as
1:17
beautiful soup so if you go to this uh
1:21
just search for this package which is
1:24
BS4 this is beautiful
1:29
soup just search here beautiful
1:33
soup yeah so this is actually the
1:36
library here it's a web scraping library
1:39
very much helpful
1:43
for Yeah so we are using this one so the
1:46
command is simple simply install this
1:48
and uh
1:50
now I will create this Python script
1:53
step by step so first of all what we
1:56
need to do we need to import the
1:57
necessary packages so import request and
2:00
then we also need to import the
2:02
beautiful soup library from PS4 we need
2:06
to import beautiful soup and then from
2:09
XML dot
2:11
er dot element tree we need to import or
2:16
as ET and then after that we just need
2:19
to use the URL lib library dotparse and
2:23
from this we just need to say URL join
2:26
URL parse
2:28
So we actually imported all the
2:31
necessary packages here you can see now
2:34
what we need to do we need to actually
2:39
uh just call a inside our main
2:47
function like this so here we are
2:50
creating specifying a URL so you can
2:54
provide any URL so I'm just providing
2:59
httpsfreemediatools.com and then we are
3:01
creating a simple custom function create
3:04
sitemap and we are passing this site URL
3:06
as a first argument and the output file
3:08
name which will get created sitemap.xml
3:11
so now we need to define this function
3:13
which is create
3:15
sitemap so which will be responsible for
3:18
creating this sitemap so just define
3:20
this function
3:22
create sitemap so it will be accepting
3:25
the URL and the file name which will be
3:29
sitemap.xml so this is actually a Python
3:31
function and here we
3:34
specify another custom function where we
3:37
need to fetch all the links of this
3:40
website we are providing this URL as an
3:43
argument here so now we need to define
3:45
this function get all
3:47
links so get all links the primary
3:52
objective of this function is to fetch
3:54
or crawl all the pages which are present
3:57
in this website for doing this we are
4:00
using this request module which is a
4:02
built-in module inside Python we have
4:04
this get method we will simply go to the
4:07
URL and then from beautiful
4:10
soup we will initialize this web
4:12
scraping library and pass this response
4:17
which is coming response dot
4:21
content html dot
4:25
parser so this needs to be in single
4:28
quotes HTML dotparsser so what we are
4:32
doing right here inside this function we
4:33
are first of all opening this website we
4:36
are getting the response right here
4:38
after that we are specif uh initializing
4:40
this beautiful SOAP library passing this
4:43
response and the second one is
4:46
HTML.parser so this will now we need to
4:49
extract all the
4:51
links so for this we'll creating a new
4:55
set and we'll be using a simple for
4:59
loop so here we are finding all the
5:01
anchor tags so we are using this find
5:04
all method inside beautiful soup and we
5:07
are searching for this a
5:09
tag
5:11
and after that we are storing it
5:15
inside like
5:17
this href so in this code we are doing
5:21
what we are doing we are simply
5:23
searching for all the links which are
5:25
there inside this website one by one and
5:28
we are storing it inside this link
5:30
variable and then the full URL is will
5:34
be equal to URL
5:36
join we will take the URL and the link
5:39
so this will and then we will simply
5:42
check here if full URL starts
5:48
with so this if the full URL starts with
5:55
HTTP then we need to add this
5:58
link by using full URL and then then we
6:03
just need to return all these
6:06
links that's all so that's all this
6:09
function does it goes to the website
6:11
searches for all the anchor tags one by
6:13
one and then return all the links from
6:15
this that's all so if you need this full
6:19
script guys the link is given in the
6:21
description
6:23
and now after getting all these URLs if
6:27
I print
6:29
this and if I try to run this
6:33
script so now if you see it will return
6:35
all the links which are there inside
6:38
this website so it has crawled the whole
6:42
website using beautiful soup and now we
6:46
just need to create the site
6:48
map for creating the site map it's
6:51
really easy
6:53
we use this ET module it contains
6:55
element URL
6:57
set XML NS is equal to
7:04
HTTP so we just create this XML
7:07
structure inside this URL
7:11
set after doing this now we just need to
7:16
run a simple loop where we add all these
7:19
links one by one
7:22
inside the site map so this
7:27
is this is little bit complicated code
7:30
but just copy it and after that your
7:33
site map will be generated you can see
7:35
that we are adding this URL element to
7:38
the location attribute and location.ext
7:42
is equal to link that's all after that
7:45
now we just need to save this XML file
7:48
for saving it we have this ET
7:52
do url set and tree dot write and here
7:57
we'll write the file name encoding we
8:00
will provide encoding
8:02
is
8:06
UTF8 and the third argument will be XML
8:10
declaration to be true because this is
8:12
XML file then you can simply print that
8:16
uh XML
8:19
sitemap
8:20
generated you can
8:23
check so let me just uh now delete
8:27
this you can do this with any
8:30
website so again if I run the
8:34
script so you will see XML sitemap
8:37
generated and you can see our
8:39
sitemap.xml XML file is generated and
8:42
you can see it has crawled the whole
8:45
website so you can see it's really quick
8:48
it's a really useful script guys because
8:50
uh sitemaps XML sitemaps are necessary
8:53
because if you want to index your
8:55
website in SEO search engine
8:57
optimization then it's really useful you
9:00
can do this really easily using this
9:02
Python script you can take any example
9:06
website and it will scrape all the pages
9:09
and add this to your XML sitemap so
9:11
thank you very much for watching this
9:12
video and also check out my website
9:14
freemediatools.com
9:16
uh which contains thousands of tools