Learn how to scrape Wikipedia with Python & Pandas to get the current S&P 500 members. We'll also create a historical constituents list for backtesting and algorithmic trading purposes.
If you're not interested in learning how to web scrape, you can download the files below.š
š Subscribe for more tutorials like this: https://bit.ly/3lLybeP
š Download S&P 500 Data: https://bit.ly/3ESGVs7
0:37 Wikipedia S&P 500 Companies Page Overview
1:53 Import Pandas and Datetime
2:15 Use Pandas to Scrape Wikipedia's S&P 500 Data
3:52 Modify Current S&P 500 Members Dataframe
6:58 Correct S&P 500 Member Date Values
10:44 Modify S&P 500 Historical Adjustments DataFrame
12:09 Correct Missing S&P500 Historical Values
13:30 Create S&P 500 Additions & Removals Dataframes
15:58 Merge Additions & Removal Dataframes
16:55 Identify Missing Historical Tickers
17:48 Merge Missing Tickers
19:59 Export S&P 500 Current & Historical Data to CSV
Show More Show Less View Video Transcript
0:00
hello world i'm super excited about this
0:02
one today we're going to put our web
0:04
scraping skills to the test and we're
0:05
going to web scrape the wikipedia s p
0:08
500 historical stocks constituent list
0:11
so that way we can build a database
0:13
actually not a database
0:15
we'll build a database in another video
0:16
a csv file or a data frame for future
0:19
back testing purposes i'm really excited
0:22
i hope you are too let's create some
0:24
code
0:25
now if you're not interested in web
0:27
scraping i'll provide a link to the
0:29
current s p 500 members and the
0:32
historical changes which will be the
0:34
files that we create in this video in
0:35
the description below
0:37
but keep in mind that we're grabbing
0:39
this data from
0:41
wikipedia which means we're at the mercy
0:43
of the good people who update this page
0:46
that provide accurate information now
0:48
this is the best freely available list
0:52
of this information that i've been able
0:54
to find
0:55
in fact you can scroll to the bottom to
0:56
the references and it does reference
0:59
maybe all of the changes
1:03
it looks like some of them might be
1:04
missing but you get the idea is it as
1:06
far as i'm concerned is the best freely
1:08
available source
1:09
if you want really accurate information
1:11
though you probably need to get a paid
1:13
source that's something to consider
1:15
now with this in mind let's take a look
1:17
at what this page looks like so we can
1:20
understand what we're going to be
1:21
scraping
1:22
so this wikipedia page has two tables
1:26
the first are the current component
1:28
stocks
1:29
right
1:30
i believe there's 505
1:32
505 stocks
1:36
they're in alphabetical
1:38
order and then the second table are the
1:42
changes
1:44
which are the additions and removals and
1:46
that looks like it's by date
1:49
okay perfect so now with an
1:51
understanding of what we're going to be
1:52
doing let's go ahead and grab our
1:55
imports we'll do import
1:58
date time as dt
2:00
import pandas as pd hit enter got that
2:03
right
2:04
and then we'll create
2:07
heading here what we'll call this we'll
2:08
just say web
2:10
rape
2:11
using pandas
2:12
fine
2:15
create the url which is just the url
2:18
from the
2:19
wikipedia page
2:20
and now the nice thing about this page
2:23
is that it has tables right we look at
2:26
this page you can see there's tables
2:28
there are some more robust packages out
2:30
there for web scraping but if it's in
2:32
table format we can simply use pandas
2:34
we'll do data equals pd.read
2:37
html
2:39
url
2:40
and then we'll go ahead and print
2:44
out see what we're working with right so
2:45
we'll say
2:47
format string we have a
2:52
type
2:53
data so this will print out the type of
2:55
the date you know what data is the type
2:57
its type with the length
3:00
of
3:01
let's think about what we expect here
3:03
we're expecting we're working with
3:05
a list of
3:07
two right because we have two tables if
3:09
that's right we have a class list with
3:11
length of two perfect so let's go ahead
3:14
and just grab the head of both of the
3:17
data frames so we'll do data zero for
3:19
their first
3:20
data frame
3:24
and obviously it looks just like this
3:27
right
3:28
and it should
3:30
and
3:32
we'll do the same for the historical
3:35
changes
3:37
and we can see that we've got some
3:40
funky multi-index column at the top
3:42
which we'll have to deal with later
3:45
but you know that's okay perfect so now
3:48
that we get the data into
3:50
uh pandas itself let's go ahead and
3:54
modify the current members
3:57
in a way that you know meets our needs
3:59
we'll just grab that data and modify the
4:01
format
4:02
we'll start by creating a new section
4:04
heading
4:06
we'll call this get current s p 500
4:09
members
4:13
and now let's get an understanding of
4:14
the format that we want our data frame
4:17
in we really only need four columns
4:20
we need the ticker
4:22
we need the name
4:24
i guess we don't need it but we want
4:26
and then we we definitely need the date
4:29
and we also want the sick code
4:32
so that's easy to do we'll just do s p
4:35
500 to create a new data frame
4:37
data 0 which grabs that first data frame
4:41
integer location we'll grab all of the
4:43
rows
4:44
we only want to get the 0 with 1 6 and 7
4:49
location columns and now we'll rename
4:51
those columns to be in the name that we
4:54
want and also lower case we'll do s p
4:56
500 dot columns equals sticker
5:00
name
5:01
date and sick code
5:07
go
5:08
and then let's just print out the first
5:09
five rows sp500 dot head
5:13
okay so that looks good uh we can see
5:16
that our sick code is actually not
5:18
formatted properly the sick code is a 10
5:21
digit code it brought it in as an
5:23
integer
5:24
it didn't realize it was a string
5:26
but that's okay we can fix that but
5:28
before we work on that let's see if
5:30
there are any missing dates and we'll do
5:33
s p 500
5:35
dot
5:36
is
5:36
null
5:38
um
5:39
okay so it looks like there's about 48
5:42
of them
5:43
uh let's see what those look like so s p
5:46
500
5:47
508
5:50
is null just a boolean indexing uh which
5:53
essentially if it is null it will return
5:56
true and return that row if it's not
5:59
null or returned false and it will skip
6:00
that row
6:02
so it looks like we've got a bunch of
6:04
nand values let's see here
6:07
scanner it's actually the that i
6:10
own
6:12
bhr
6:14
okay so some of these are missing dates
6:16
it is what it is okay
6:18
so how do we want to handle that we'll
6:20
just set it to
6:21
1900s maybe as a community
6:25
everyone can get together and they'll
6:27
you know let me know in the comments
6:29
when when these were founded i could
6:31
create another spreadsheet and
6:34
you know merge this all together so
6:35
maybe we can actually create the
6:37
definitive s p 500 list but for now
6:40
we'll just
6:41
fill it out with a dummy value that we
6:44
know
6:45
you know 1900 you know zero one zero one
6:48
because the s p was
6:50
invented after that
6:53
okay so let's see here
6:56
okay so what we'll do here is let's go
6:58
ahead and
7:00
list all of the dates
7:03
that don't meet our format right so our
7:06
format's really supposed to be in year
7:09
in four digit year dash two digit month
7:12
dash two digit day if it doesn't meet
7:14
that
7:15
we'll display that to the screen to see
7:17
what we're working with
7:18
okay so we'll create a mask to do this
7:20
again that's boolean indexing smb 500
7:24
date
7:25
we'll use the string method to strip
7:27
all of the white space run it just in
7:29
case there's white space there may not
7:31
be
7:32
uh just do it as a precautionary manager
7:34
we'll do full match which will match the
7:37
entire
7:38
um
7:39
format that we give it we'll do digit 4
7:43
dash digit two
7:46
dash and digit two
7:49
okay now obviously this gives us
7:52
the correct row so we actually
7:55
want to
7:56
fix the no nan so we want to make the
7:59
nands false and then reverse it so we'll
8:02
do mask dot lock
8:05
master is null so wherever there is a
8:07
null or nan we do it's false right
8:11
because it's not correct and then we
8:13
just reverse our mass so we'll do mask
8:15
equals
8:16
mask
8:18
we're going to type here mask equal
8:20
equal false
8:22
and sp 500 mask this will give us all
8:26
our incorrect values
8:29
so it looks like we have one weird value
8:32
with a t and t
8:34
that's okay we're going to set all of
8:35
these values
8:37
to
8:38
you know 1900 so
8:41
create a new
8:42
row here we'll do current
8:44
close smb 500 copy because we're
8:47
modifying a slice of a data frame would
8:49
you know just prevents warnings
8:51
do current
8:53
grab all of the rows in our mask
8:56
with the date column we'll make that
8:59
1900 0101
9:02
okay now that all of our dates are fixed
9:05
we should be able to change this into a
9:07
date time
9:09
column we'll do current
9:11
block
9:12
all the rows
9:15
date
9:16
is equal to
9:18
cd
9:20
date
9:21
time current
9:23
date
9:26
do the same thing we'll do current lock
9:28
all the rows
9:30
click
9:31
equal to
9:33
now here we're going to fix our you know
9:36
integer issue we're going to
9:38
apply it
9:40
or change the column to a string and
9:42
then use a z fill to fill in zeros
9:46
where they're missing essentially we'll
9:48
do current
9:52
apply
9:54
ring so that makes it a string then
9:56
we'll use the string method
9:58
refill 10
10:00
and then we'll
10:01
print
10:02
out first 10. i'm going to typo here
10:07
and hit enter see how we did
10:10
perfect so it looks good
10:12
okay so
10:13
that all looks good do we have a yeah we
10:15
do have one amd
10:17
obviously that's not correct because
10:20
i don't believe amd was one of the
10:21
original
10:22
s p 500 members but that's okay
10:26
we can do this we can
10:29
fix this as a community so okay perfect
10:32
so now the next thing that we're going
10:33
to do since we now have our current s p
10:36
500 constituents is we'll get those
10:39
adjustments into a format
10:41
that's much more manageable
10:45
we'll start once again by creating a new
10:47
column
10:48
we'll call this
10:50
pre
10:52
manage adjustments
10:55
oops
11:00
okay and now let's go ahead and just
11:03
create that adjustment data frame
11:06
adjustments equal data one to get that
11:09
second element
11:10
and since
11:11
i don't recall exactly what it looks
11:13
like let's print it out
11:15
so it appears that we've got a level you
11:18
know our index is just an integer index
11:20
and then we have some type of
11:22
multi-level
11:23
column up here
11:25
let's get rid of that first
11:27
we can do that simply by replacing the
11:29
current column with a new column so
11:32
we'll just create that date
11:34
picker added
11:35
name added tick removed name removed and
11:38
reason
11:40
column
11:41
equals
11:44
date
11:45
picker
11:46
added
11:48
name added
11:51
picture removed
11:54
name removed and reason
11:58
and we'll just overwrite the current
11:59
columns we'll do adjustments
12:01
by columns equal columns
12:04
and let's print it out to make sure that
12:06
worked
12:07
and it did
12:09
okay so now let's
12:11
check to see if we have missing values
12:14
we'll do adjustments dot is no
12:17
um
12:19
it does look like there's missing values
12:22
let's see what we're working with
12:25
we'll just go ahead and make adjustments
12:29
adjustments
12:30
thicker removed
12:33
but there's no
12:35
it'll print out the rows where we're
12:36
missing
12:38
[Music]
12:39
that
12:40
thicker so let's just go to ogn
12:46
okay
12:47
so it looks like
12:50
you know we have you know a ticker
12:52
addition and removal can happen
12:54
right after
12:56
one another
12:57
so
12:58
we'll just assume that everything's
13:00
correct because everything's referenced
13:01
so how do we go ahead and
13:04
essentially create this additions and
13:06
deletions data frame so what we'll do
13:09
is we'll just grab all of the date and
13:11
it will grab the data frame
13:13
where the ticker added is not null
13:17
which will be the additions we'll grab
13:19
we'll create another data frame where
13:20
the ticker removed is not null
13:23
right and then we'll just concatenate or
13:25
basically stack them on top of
13:28
each other
13:29
so that's that's pretty easy to do let's
13:31
do that now so do
13:33
additions
13:34
equal adjustments
13:38
where and this
13:39
makes it the opposite right it's
13:42
not null we'll do adjustments
13:45
or ticker added
13:48
to null so again that gives us not null
13:51
then we don't need all of the columns
13:54
we'll do
13:56
date
13:58
thicker
13:59
added
14:02
name
14:03
added
14:06
and that's
14:08
pretty much pretty much it
14:12
okay
14:13
we'll do
14:14
editions
14:16
columns equal
14:18
date
14:21
victor
14:22
name so basically what i'm doing now is
14:24
i'm just going to rename both of the
14:26
columns and the additions and the
14:28
removals to the same so we can stack
14:30
them
14:31
we'll do it but we also need to
14:33
understand what action we took so we'll
14:35
do edition action equals added
14:39
and then additions
14:42
oh what i do here edition
14:44
editions
14:47
perfect
14:48
okay so we'll do the same thing
14:51
for removals
14:53
we'll do
14:54
removals
14:55
equals adjustments
14:58
and we're doing the exact same thing
15:00
adjustments where ticker removed
15:04
is not null
15:08
and we'll get the same columns
15:10
pass
15:17
removed
15:19
name removed
15:22
and then again we'll rename those
15:24
columns removals.columns
15:28
date
15:29
occur we can't keep them the same as
15:30
they were prior because then our
15:33
concatenate won't work
15:35
they would have been different columns
15:37
the ticker
15:40
name
15:41
and then removals
15:44
and before i make a mistake removals
15:49
action so we have to do the action which
15:51
is removed
15:54
and hit enter
15:57
okay perfect so now let's create a new
16:00
data frame we'll call it historical
16:02
where we simply
16:03
stack them on top of each other with
16:05
concatenate we'll do historical
16:08
historical equals pd
16:10
on cat
16:11
ask and cat a list we'll do editions
16:15
and read
16:16
removals
16:18
we'll print that out
16:19
where it works and we didn't
16:21
accidentally mess something up
16:25
on
16:28
cat
16:29
hunt cat con cat
16:36
kitty concat what am i doing here
16:40
sorry about that
16:45
all right now what we need to do is we
16:48
need to determine
16:50
if
16:51
you know this historical list is
16:53
complete right so let's go ahead and see
16:57
if there are any missing tickers so
16:59
we'll do missing
17:01
equals
17:03
that was what we got previously where
17:05
the current
17:06
picker
17:08
is in
17:10
the historical
17:13
sticker
17:15
right
17:16
and we'll copy that but basically what
17:18
we're saying
17:20
that you know if the current ticker
17:22
does not exist in the historical ticker
17:27
then we need to create a missing right
17:29
because they should all
17:30
exist and create a new data frame out of
17:33
it it looks like we do have some missing
17:36
so we're going to need to
17:38
fix that
17:40
so with every section we'll create a new
17:43
column heading only this time i'm going
17:45
to create it above because i think that
17:47
makes more sense do
17:49
merge
17:51
missing pickers
17:56
and then we'll create a new cell down
17:58
here
17:59
and then you know let's think about the
18:01
action that we're going to take if
18:03
the
18:03
ticker is in
18:06
um you know obviously if the ticker was
18:08
missing from the historical the act it
18:10
was added at some point but we don't
18:11
have a record of that so we need to
18:13
create that missing action go to missing
18:17
action
18:19
equals added
18:21
and then we'll just
18:22
format missing
18:24
in the format we want
18:26
missing equals missing pass it a list
18:29
we'll do date
18:30
picker
18:32
name
18:34
action
18:35
and we'll also pass the sick code i
18:37
don't know if we'll need that but i'll
18:39
put it in there just to be complete you
18:42
can always
18:43
you know remove it from your file if you
18:46
want
18:46
it looks like we've got all the missing
18:48
data now
18:50
now it's pretty easy we just simply uh
18:52
stack it all to on top of each other so
18:54
do s p 500 history equals pd on cat just
18:59
like we saw before
19:00
past pd concat a list we've got our
19:03
historical
19:05
and
19:07
are missing no no no almost did it again
19:10
i'm sure what's with me today historical
19:14
and missing
19:16
and then uh what we'll want to do is
19:18
we'll sort the values just to make
19:20
things a little easier to read go to s p
19:22
500 history equals s p 500
19:27
history
19:29
about sort the values
19:31
by
19:32
after list we'll do date
19:35
and ticker
19:36
and then we'll also adjust
19:39
the way that it's being sorted we'll do
19:41
ascending equals pass it a list
19:45
false and
19:50
false and true
19:52
and s p 500 it's three
19:56
and there's our s p 500 data frame
19:59
perfect so now we've got one more step
20:02
and then we're done this is where we
20:04
export everything to csv
20:07
create
20:10
a heading export
20:11
to csv
20:15
since
20:16
we might export this at different times
20:19
let's just grab today
20:21
in string format so we'll do date time
20:25
today
20:26
drf time and then we'll just pass it the
20:29
format string
20:31
month
20:33
okay and see got that right
20:37
perfect so that's today's date and now
20:39
let's export out each
20:41
of the data frames into a file we'll do
20:44
current
20:45
ucsv
20:49
format string
20:52
today
21:00
and then we'll do sp500.csv
21:04
and we'll do sp 500 history
21:07
q csv
21:09
f
21:11
same thing
21:12
today which we're just prepending the
21:14
date that's all we're doing sp 500
21:19
history.csv get that right today
21:23
okay perfect and we hit enter and that
21:26
worked so awesome
21:29
and that's it hopefully you learned
21:31
something along the way and at least now
21:33
you have the s p 500 historical
21:36
constituents list in both data frame and
21:38
csv format for future back testing
21:40
purposes well that was a mouthful but if
21:43
you like the video please please please
21:45
give it a thumbs up it lets the google
21:47
algorithm know that this video is worth
21:48
sharing and i hope to see you in the
21:51
next one thanks again bye
22:02
you
#Programming
#Investing
