Azure Open Datasets - Azure AI show
10K views
Nov 9, 2023
Join us on a new episode of Azure AI Show on January 18, 2021 at 10:00 AM EDT with Stephen SIMON ABSTRACT Azure Open Datasets are curated public datasets that you can use to add scenario-specific features to machine learning solutions for more accurate models. Open Datasets are in the cloud on Microsoft Azure and are integrated into Azure Machine Learning and readily available to Azure Databricks and Machine Learning Studio (classic). AGENDA In this episode of Azure AI Show, we'll cover the following: • What are Azure Open Datasets and how can you use them? • Create Azure Machine Learning datasets from Azure Open Datasets • Azure Open Dataset Catalog #azure #azureopendatasets #microsoft #datascience #machinelearning #ai #python
View Video Transcript
0:30
Hi everyone. Welcome back to Sishapakuna Live show. I'm your host and your guest for this show. That is the Azure AI show. Welcome back, everyone. We are back with after maybe a little long break, right? But I'm really excited for this episode. As we're going to talk about something that is very exciting. At the same time, it's going to be very, very helpful. But before we go ahead and get started
0:59
with this show. Welcome back, everyone. I hope you are off to a wonderful new 2021. And I'm your host
1:08
Stephen Simon, and I just realize that many people don't know. I act as the regional committee
1:13
director for C-sharp Corner. And everyone who is joining us for the very first time
1:18
stay tuned for tomorrow, 12 noon. Tomorrow is 19th of January, 12 noon, eastern time
1:25
We have something really exciting coming up at C-sharp Corner. So stay tuned
1:29
on social media platforms as we are very excited and we just cannot be able to tell what we have been doing in past years
1:37
So that's the update that we have tomorrow, 12 noon Eastern. And let's quickly talk about the different shows that we have on C Sharp Corner
1:48
The first one is the definitely Azure ratio that we do 8.30 p.m. ISD and 10 a.m. Eastern
1:54
Just remember any show that you find on C. Shop Corner, they always get streamed
2:00
at 10 a.m. Eastern. Maybe a couple of them may change sometime, but most of them are always streamed at 10 a.m. Eastern
2:08
The second one that we do on Tuesdays is a product showcase show, or we also do a coffee with pros that happens
2:15
7.30 p.m. is still at 10 a.m. Eastern. And at tomorrow, we have a product showcase show that I'm going to talk about towards the end of the show
2:24
what we are going to talk about tomorrow. But stay tuned to even tomorrow. Tomorrow is, I think
2:29
kind of busy day for us. Then on Wednesday we do this C-Shap-Connor MEP show. So if you are someone
2:37
like MEPs are someone who go ahead and contribute to the community. There are C-Shap-Connor MEPs
2:43
We invite them to the live show. We talk about their journey and not just that. We also have a
2:49
technical demo, some slides and all that so that you can always tune in and there's always some
2:53
fresh content coming up. After that, we have this C-Shap-Connor AMA. We are at
3:00
around 27 episodes now. Now we are moving to season two. So this is a show
3:06
which is definitely a very technical show, guest from different ecosystem of the
3:13
different ecosystem would come and join and talk about the particular topic
3:18
You can ask them some questions. So definitely you want to tune in for the Ask Anything show
3:23
And then on Fridays, we have a very exciting show that's a growth mindset job. That is led by the founder, Maheshan
3:28
who talks about that why and that talks about the growth mindset and why you just should focus
3:34
on just coding and all and also focus on your career having said that today's topic we're going
3:41
to talk about the Azure open datasets okay let me go ahead and quickly share my screen
3:47
okay okay so there's something wrong I'll do it again perfect so that's the topic for today
3:55
today we're going to talk about Azure open datasets now Before we go ahead and talk about the
4:01
okay, that's work fine. So before we go ahead and talk about the Azure Open DataSets
4:06
if you're someone who do not come from a machine learning data science background, think of it in this way
4:12
that data science project, the very first thing that we usually do is in the data science
4:18
is that we get the data, right? Now that data can be in the images form
4:23
it can be in BDF, it can be in Excel, it can be a text file, it can, it can come
4:28
in any way. So what people would usually do is they would go to a particular resource
4:33
There are different resources out there. Like you can get the datasets from Kaggle, then there's
4:38
UCLI. I think I just said that wrong. Right. So there are many different places where you can go
4:44
ahead and download the data, put in your local machine or maybe if you're working on your
4:49
virtual machines, and then you can go ahead and start working on your project. So that's the
4:54
concept of data in the field of data science and machine learning
4:58
Now, when we think of this Azure open datasets, right? So let's move to the very first line that we have
5:08
I know this is going to do this way. Okay, this way, perfect. Now, what is Azure open datasets
5:14
Now, think of it this way that Azure open data sets are, as the slide says
5:20
is Azure open data sets are curated public data sets. Now, what does this mean
5:24
Now, what Microsoft did is that Microsoft took all these, best datasets that were in different, different platforms, right
5:31
And this added in their one repository. Now, this repository is actually Azure Rupon datasets
5:38
Now, it says curated datasets because if you go ahead and online and start searching for the datasets
5:43
there are many, many datasets. Some of them are not perfect. Some of them may not be very useful
5:48
So what the Azure team has done is they have handpicked some of the best
5:53
data sets categorized into them, into different, into different categories, right? And that's how they have put all these data sets
6:04
Now, now a quick update is it's a scenario specific features to machine learning solutions for more accurate models
6:10
What it says that this now what I mentioned, that you can have data sets depending upon
6:15
or your case scenarios. They can be weather-based data sets. They can be images-based data
6:24
And then there are many more data sets that you can actually, maybe I can just go ahead and quickly pull it up
6:31
oh and your data sets I'll just Google it okay hold on hold on yep so if I talk about this categorization
6:43
it can be the the weather data sets it can be loading loading loading
6:49
okay wait it's just loading right you can see I just googled
6:55
okay just think of it this way by the time I get all the these different categories think of it that your your data sets are categorized into
7:01
into different categories and the best part about this is that all these data sets
7:06
isn the cloud that it on the Azure so you need not worry it readily available over there you can just go ahead and get started And the best part is that no any random data sets over there
7:19
There are many some good sentences like weather, city, safe, public holidays. And they just categorized it very nicely
7:25
So that should be very helpful. Moving ahead, you want to know that why you want to use this data sets
7:33
Why you want to do Azure Open datasets? A couple of reasons that I definitely give you that
7:39
they're handpicked so that they're really good they've categorized it so that really helps
7:43
well second thing that you can always go ahead and share the existing data that you have to the
7:48
community that doesn't mean if they're for instance there are a thousand data cells in the
7:53
azure open data system so it's not like they're going to be just a thousand teretsets if you are
7:58
someone who work in the field of data and you feel like okay this is a good data set that i have
8:03
or maybe something that i have created and i want to go ahead and and summit it to the community you can just
8:09
go ahead and do that. You can take your data set and put it into this Azure open datasets committee
8:15
so that that's brilliant. So that that's one of the benefit. Second, since it's in cloud
8:21
so definitely it has great speed. You need not go ahead and download it. If you have an active
8:26
Azure subscription, you can just go ahead and just import it in your project. We're going to talk
8:32
about how you can do that later on towards the second half. And that's it. No downloading, nothing
8:39
just to write one statement import so-and-so data set from so-and-so and so and that's it so
8:44
definitely it helps you with the with the speed too going ahead uh let's talk about the how you can
8:52
access the data that is very important now in the beginning of the show i did mention that
8:57
when you talk about the accessing of the data uh you what usually data science guys do is
9:02
if they download it and that that's how they do it in some cases there might be some APIs too
9:06
So to access the data says, the two ways. First, they are definitely integrated with Azure machine learning and they're readily
9:14
available to Azure Databix and Azure Machine Learning Studio. Now, for someone who do not know what is Azure Machine Learning and Azure Machine Learning Studio
9:21
think of this as a machine learning that you can do in Azure
9:26
Now, if that's what you are doing it, if that's where your project is running
9:32
then as then picking up Azure Open Data Set, it's just a piece of cake
9:36
It's already readily available over there, right? You just need to go ahead and add it into your project
9:41
So that's one way you can do it if you're already using this Azure machine learning and Azure Databricks
9:48
Now, second, if you want to go in and use it in some any third-party libraries or maybe third-party
9:53
applications, so Azure data sets come with APIs that you can add them in some other products
9:59
or it may be Power BI, it may be Azure Data Factory, or it may be some of the application
10:05
that you have built and you want to go and use it. So that's how you can do it
10:10
So first one is either you can do it inside Azure Machine Learning Studio or Azure Open Data
10:15
says come with some beautiful APIs that you can use it in your some other product
10:20
So that's how you go ahead and actually access the Azure Open datasets
10:25
So welcome everyone who is joining us again. Thank you everyone for joining
10:29
Today we're talking about Azure Open datasets. And let's go ahead. If you have any questions, just let's let
10:35
let them come in the comments and i'll be very happy to go and answer them again now before we go
10:41
ahead and get into little detail about uh how to get the datasets and what and all that
10:48
let us quickly look at some of the building blocks now what you see on your screen the
10:54
towards the bottom right it's a public data so it's such as an oa so it's a kind of data right
11:00
just think of it that this is a datasets now this data sets gets pulled by an
11:05
Azure Data Factory or by a similar job scheduler. Now think of this that this data is stored in Azure and they're using Azure Data Factory or maybe any something very similar to it to fetch that data. So now when you fetch that data, you can either put it into Azure blob or Azure Data Lake Storage or you can put into a Cosmos TV or you can put into any of the database. So what you have done, you have created a job scheduler. You have pulled the data. Now you want to go ahead and store it
11:35
Now here's something very interesting thing to note that we have pulled the data and now we are storing in our different in our own personal space, which means that you are actually replicating the data, right
11:47
You're not messing up with the original data, not just data you're replicating, you're also replicating the metadata
11:54
So that's pretty cool. You get your own version of the data that you can go ahead and start using it
11:59
Now once you have it in your storage, you can format it with open data
12:05
You can format with Azure storage SDKs or you can also work with the rest APIs
12:11
Now these are the places where you can go ahead and start giving it to your machine
12:16
learning projects. For instance, now once you have this data over here, you can either give this data to
12:22
Azure machine learning studio, Azure data breaks, virtual or local machines, or you can also
12:30
with the help of APIs, you can integrate in some business apps or you can also add in
12:34
some data exploration tools, Excel, Power BI, Tablo, or whatever. So you get the basic idea on how this happens, right
12:43
So this is how actually Azure Open DataSys looks like. The only thing you take away here is that you actually make a duplicate or you can see
12:51
you get your own copy of your dataset, right? Store it in your own database or storage and then you go ahead and use it
12:59
You do not mess with the original data. And who wants to mess with the original data, right
13:03
So that's how it's done with the Azure Open datasets. Now, let's go ahead and talk about creating of these data sets
13:12
Now, before I do that, maybe what I can do is, okay, never mind, we can still be here
13:20
Now, let's talk about creating of the datasets. Now, the creating of the datasets can be divided into two ways
13:28
First, you can either do within SDK. Let me just go ahead and put it up
13:32
you know i just or maybe not okay it's still running okay so either you can do it with an
13:39
SDK or you can do inside the studio now to get started with the SDK the very very first thing
13:44
that you need is you need to go ahead and have an Azure MLSD installed and how do you do that
13:51
yeah yeah if you want to go ahead and ask any question you can definitely do that it's okay anyone who's joining us and if you want to go ahead and ask any question feel free to ask uh
13:59
i'll be more than happy to answer if i know the answer right if i don't know i'll just say no okay so the very first thing if you want to go ahead and use
14:07
azure open uml data sets with the SDK the very first thing that you do is you need to make
14:12
sure that you have installed as your ml sdk and how do you do that you write pip install
14:18
as your mlsdk pip installed as your ml sdk that all you need to do okay and uh i preferably use visual studio code that that really that really beneficial right So that the people really love it
14:35
So, and it gives you some really cool tools too. So that's one good thing to follow
14:42
So once you have installed Azure SDK, the next step is, the next step is, okay, so before
14:48
I go ahead and talk about this, Nathan is asking, as you get related to dotnet core
14:54
Nathan, this Azure MLASDK is more driven for the Python, right? This is more driven for the Python, but I do know that ML
15:04
If you want to go ahead and use something, you want to do this Azure ML thing with this dot
15:10
net, there's something called ML.net. You definitely want to explore that. Now this is a good question
15:16
I'm not sure. I think you should be able to use it in your ML.net
15:20
Definitely I'm not sure about it. But I can tell you about this SDK, it's for Python developers
15:27
Right. So that's how I can answer that. So once you have installed those Azure ML SDK, what you want to do is you want to go ahead and install Azure ML Open Data sets
15:36
How you do that? Simple. PIP, install Azure ML Open Data sets. That's how you do it
15:41
So I actually did it. And you can see it on the screen
15:45
It's going to take a little time. Right? So I don't know that that's a really long process
15:52
right so that's that's interesting so I already had it running if you go ahead and see in
15:59
the top I did okay you cannot see it okay it's PIP installed as your MLOp
16:04
there is it that's what I did it and that's what it did install right so I think it's
16:08
almost done over there towards the end to yeah it's done perfect so once you have
16:16
done that now what you can do is if you have any Python file you can just write
16:21
what you need to do you just need to go ahead and and import the library right imported
16:29
datasets that is from Azure ML.cour import datasets and remember in the Azure open
16:35
datasets every data sets has a name right every data set has a name so if I talk about
16:41
this example we are importing from Azure ML open datasets we are importing
16:46
Amnest datasets that is actually the name of the data sets okay
16:51
So we're importing that. Now, when you go ahead and import a datasets
16:55
there are actually two ways you can store it. The first one is you can store it in a tabler data set
17:01
tabler form, or you can store it in a file form. So if you want to go ahead and store your data in a tablet form
17:08
you can use Amnest. Dot get tabler datasets function, or if you want to store it in a file manner
17:16
you can just do that. Dot get file datasets. So that do it
17:20
But in some cases, they might. be only tabler in some cases there might be only five okay so that that's how you do it with
17:29
the SDKs now moving ahead you can also do same thing with the with the studio what you can do
17:37
do is you can go to your workspace right and select the dataset tab under the assets and then
17:42
create a data sets from the drop down menu and select from open datasets okay so you can see you get
17:48
different options you get from local files from data store from web files, what you need to make sure that you go and select open datasets
17:55
Once you have done that, you now get an option to select the data, Azure Open
18:01
datasets and you can just go ahead and use this small search bar and search any
18:07
the datasets that you want, go ahead and click it, click on next and that's how you get selected
18:13
Now here in the example, we have used public holidays, but it's a pretty famous data sets
18:18
and that's how easy it is to gore and see it data set
18:22
Next you click on next button and then it tells you in that you can go ahead and give a name
18:29
a particular name. You can even filter it depending upon, depending upon a particular date and then you can also select any country region code
18:42
Okay, can we store any any file in datasets? I'm not sure what I'm not sure about
18:48
exactly exactly you mean but um if you're talking about your machine learning project right you can
18:56
definitely have a file i mean once you have data sets right once they're using it uh i would prefer
19:02
that you don't go ahead and add right about any any external file with it definitely you want to go
19:06
ahead and do some um you drop some columns make some changes you definitely want to go ahead and
19:13
filter your data uh but adding any file in the data set that would be interesting on what file you
19:18
are okay you mean picture okay so nothing if it's an if it's an image I can for
19:24
instance as we took Amnist right so if there's any image to cognition project
19:29
and if you want to add your own image you can definitely do that but make sure
19:33
it will remain only to you because if you if I go back and see when you pull the
19:38
data since right you just store it in your own storage in your own data it may be
19:43
blob it may be anything so definitely once you have a copy of your data set
19:47
you can add you can definitely add any extra file that you want to do it so yes that there is yes
19:53
you can definitely do that let's come back to the topic now you can go ahead and select
19:58
now there might be some data a date range right you can definitely slate from which date you
20:04
want to pick up and uh us country and code i mean that that's something you got always can figure
20:09
out okay that's it okay all right so so i think this is just the last five minutes is it was
20:15
small live show like a short live show okay so before we go ahead and talk about different stuff
20:21
what i want to do is um what i want to do is i want to go ahead and pull up the actually the the
20:36
website right the as you open i'm going to drop this link in the comment so that you all can
20:42
follow along don't don't skip the live show right you can click you can click later on so this is the actual link for the Azure Open datasets right you can always
20:50
go ahead and check the latest updates so what I'm gonna do here is I will go to
20:58
let me find the datasets okay if I click on the weather data sets for instance
21:05
now this is a category that they have given now you can see I have my data set has been
21:10
filtered by weather and I get all these data sets as now for instance
21:15
I go ahead and maybe, no, I don't suppose I don't want to pick a data set from here, right
21:21
I want to go to the actual repository where I can find all the data sets. I click on this open data sets at the top
21:27
Now here, right? Here I have all the data sets. What I want to do
21:31
What I want to do? I want to go to maybe NYC taxi a green taxi trip to God site I mean let suppose that I want to pick that data sets Now if you go here you can you can see it has the name it has tags it has the description the volume and retention the storage location of the datasets the data set is stored
21:54
in the US Azure region it has some additional information and the West
21:58
part that you can also have this sample notebooks where you can see like for
22:03
instance I'll just open that in the tab we'll just go and look it then you can
22:08
to see how you can go in and use a dataset. Okay, Nathan, you're asking a question
22:13
Data says depend upon size like ambient gigabyte. Definitely, yes, I won't say gigabytes
22:19
I mean, these are mostly for experiment purpose, right? So they won't be in gigabytes because think of it in this way
22:27
If you take 10,000 records of an Excel file, Excel file has 10,000 records
22:33
what will be the size of the Excel file? 10 mb, 15 mb, 20, and 20
22:38
be 50 mb that's as maximum i mean use the thing of a gigabyte of excel file that's a lot of data so and
22:46
that's going to cost you a lot so these are meant for for more to make sure that your project works
22:53
a little more well to give you more accuracy in your existing projects definitely you can
22:57
quote and practice so when you talk about data sets depend upon the size yes and in most of the
23:03
cases it is mb not gb so yeah the answer is mb
23:08
Coming back over here, we have the sample data sets or you can say the described function
23:16
kind of used to see what a dataset looks like. You can see we have different columns and their headings, vendor ID, pick up date time, drop
23:25
time, passenger count and all that just goes over there. So this is the actual data sets, right
23:32
Just the sample data sets, you can actually have a view. And if I go ahead and click on columns, you see
23:39
Now, it has all the details for a particular column. For example, the do location ID, what is the data type, how many unique values are there
23:49
what are the values and the description. So these should be very helpful
23:54
If you want to go ahead and explore some of the really nice datasets, and these are well documented
24:00
and I think if you go ahead and decide to use any other datasets, these are the best
24:05
resource you can go and look at it and understand your your data now let's go ahead and click on
24:11
uh okay data access okay now see here's how the examples on how you can access this
24:18
for instance here we have how you can do it in for a notebook very simple from azureml
24:24
open data sets import n y c tlc grid so here you get an idea that if you want to go and use
24:30
data sets the name for this dataset uh is n y cd lcdd
24:35
agree that's okay okay nathan uh you can always go ahead and and drop your comments coming in
24:43
right so it's okay if you have many questions you can keep on dropping in right i've become every
24:48
monday at 8 30 p.m every monday at 830 p.m so if you have any questions you can a shout out to me
24:55
that's okay what's my Twitter account okay let me just go ahead and write um
25:05
That's my Twitter handle. That's my Twitter handle. You can always tweet your old questions that you have and I'll definitely make sure that I answer
25:16
all those questions during the live shows. So I hope that makes some happiness to you
25:25
So let's get back and this is the second last thing that we're going to look at today
25:30
We're almost at times. You can see here you can use it in your existing notebook from Azure
25:34
ML. Open datasets, you can import the dataset, definitely can have date, time, and all that stuff
25:41
So this is how you actually use it. You can also use Describe, and that's going to show you all the details
25:46
So very similar to what you actually, you would have done if you had data in your local machine
25:54
It's just the only thing that now you don't have to download it, just use a couple of statements, right
25:59
And then you have access to the entire datasets. All right. Now let's get to the important thing
26:06
What is the cost? Now, since this is such an amazing thing to go ahead and use it
26:11
you might wonder that Azure is not free. Well, Azure is free if they have free versions
26:17
Like you can go for 12 months free trial. There are amazing services that you can go out and use it
26:22
But you wonder that, okay, what is the cost for this Azure open data sets
26:29
The answer is you can use it for additionally no cause. I mean, how beautiful is that
26:35
This is maintained by the community members, supported by Microsoft. You can go ahead and add all your, if you want to go ahead and add any of the datasets, you can definitely do that
26:46
And the best part, Azure Open datasets is free. It's for the community, it's free, and you can just go ahead and use it
26:55
You only have to pay for your Azure services. For instance, if you're using a virtual machine for the storage, the networking
27:03
and all other computation of power. So you'll have to be only for that
27:08
but if you want to go ahead and use open datasets, it's entirely free and that's the best part
27:17
So that's all we have for today, but I did share this resource that should be very beneficial
27:24
for you, right? And I'm also gonna pull up. Where did we go
27:33
can't find it just just go to this link right this should be very helpful for you to go ahead and
27:43
explore more about azurepun datasets and uh that was it for today
27:51
that was it for today let me go ahead and pull the screen again
28:03
that's fine uh-uh-uh okay all right so that was all for that for today
28:09
right we did talk about what is azure ml open datasets in how different ways you can go ahead and
28:14
use it with the azure amel studio and with the SDK we talked about the pricing we also saw how
28:21
you can import it uh we also talked about some of the features and definitely whether it's
28:28
free or not so that was what we covered today uh follow up uh
28:33
Next Monday at 8.30 p.m. Eastern and 8.30 p.m. Indian time zone and 10 a.m. Eastern time zone for the next episode of the Azure AI show. Until then, take good care of yourself. My name is Stephen Simon, and I'll see you in the next episode of Azure AI show
#Computer Education
#Machine Learning & Artificial Intelligence