Data-Science Training Session

How was the day haan?

I hope it to be nice! But if say about me, it was pretty laborious and excited at the same time. The thing is that I recently completed a data science project which was given to me and my team by our data science trainer Mr. Manish Kumar Jain. And he is a superb man with marvelous skills and generosity let me tell you. And this reminds me that I haven’t written anything about my Industrial training to you guys. Right? Riiiight! So let me just walk you through the gist of my training sessions with this blog of mine. Ladies and gentlemen just fasten your seat belts to ride through my training sessions but only if you allow me by clicking the read more link below.

Training Day – 1

My college last week arranged for an Industrial Training sessions for IT students and my fortune that I am among the party. Theoretical classes were off for the complete week and we just had to attend the training from 9:00 to 5:00. This was my dream come true. I ever wanted to work like this in college. And here we stepped in THE FIRST DAY.

As we filled up in a Computer laboratory which was going to be our venue for all the coming sessions. My excitement was all on its notch that I never felt before. As soon as the lab filled up completely A man our trainer, guide, teacher, mentor or other respected salutation for him is just a word because the man held a speechlessly fantastic personality, I mean I can’t explain the most … nature he possessed, see I am not able to use even a worthy adjective!

Sir introduced himself with his name Mr. Manish Kumar Jain. And I am not going to forget this name ever in my life.

He started to begin a conversation with us and no sooner he established a very friendly environment. We were for the first time in college were not bound to attend a theoretical class but were willing to learn something new and practical for this He assured us.

The first day was dedicated to introduction of little conceptual kinds of stuff and practical tools to implement them, what they are let’s see-

BIG-DATA:- As the name suggests Big Data is nothing but the bulk of data, may or may not be structured. The bulk of data as such the database of facebook, WhatsApp, etc. The data which is in huge amount and need to be processed is simply called Big-data. It has two of its implementation:

1. Hadoop

2. DataScience

See, Big data is a concept. But now how we make the implementation of it makes all the differences. So here we 2 of its major implementation. Let’s discuss them one by one.

1. Hadoop

Hadoop is one of the implementations of data science. It which you can process the big data, analyze it and on the basis of given database, you may predict the future possibilities. The same thing you may do in data science as well. But what makes all the difference is that: In Hadoop to process on the bulk of data, you would need to write the bulk of code as well. Which means coding in Hadoop will cost you large code and complicated implementation that comes by default with long codes.

2. DataScience

Similar to Hadoop in objective but largely varied in its approach. Data science is the smart approach to process, analyze and predict on the basis of a dataset. The question comes How? And here is the answer, In Datascience the programming language which we use as major is a really high-level language whose functionality is much larger than Hadoop. That advanced PL has newly come in the market for last 2 years that language is “R programming”. It was the first time I heard about the language but as the training grew up we were introduced more towards its functionality and understood that the language worth learning in present scenario.

Internet Of Things (I O T):-

Now this is a fancy thing which may be defined as linking of physical electronics/electrical devices with Internet or automation is simply the meaning of IOT. e.i automation of ACs with respect to the temp. around, or all the various automation in the smart cities’ project. And many more example you may get if you Google the term.

Cloud Computing:-

Cloud is nothing but a remote server (hardware component majorly), computing is any process in synchronizing or integration of work by more than one user at the same set of data. e.i:- the same project that is maintained by developers of different places with the use of cloud. Now cloud can be of various types like Public, Private, Hybrid, community cloud.

R – programming:-

The language is really advanced. It can do anything, believe me literally it is capable of doing most of the current needy stuffs in IT from simple coding to animations to web designing to scripting to dealing with Big data. You just have to install the concerned package and just get started with the task.

Rstudio:-

Rstudio is an IDE (integrated development environment) for specifically “R” programming language. Within the single window you may see the code, console, memory status, plots, animations, etc. It was a really worthy IDE for R language.

Other Terminologies like Data-Mining, Data-housing/warehousing were discussed.

And no sooner the introduction of these terminological concepts were done. We moved towards a more practical approach of datascience. That was to start learning “R language” using Rstudio and to deal with example data set to polish our skills.

We downloaded Rstudio as the framework and R- language version 3.3.2. For the first time we opened the Rstudio and it was pretty handy to work. I liked it!

Commands we got introduced on the first day of training:-

1. data() — will show all the default datasets available with rstudio to work on it for practice

2. AirPassengers — it is one of the default dataset in Rstudio

3. class(AirPassengers) – will return the class of the dataset provided.

Class of dataset may be data.frame, numeric, integer, double, character, matrix, list, graph, etc

4. View(trees) – will give the view of trees dataset in tabular form.

5. nrow(trees) — return no. Of rows.

6. ncol(trees) – return no. Of columns.

7. attributes(trees) – will return basic info of the dataset like rows, class, columns.

8. print(trees[,2]) – will print second column of trees dataset.

9. A:B – semicolon will print provide numbers from “A” to “B”

10. c(1,2,3,4,5) – this is combine operator will combine the values in one vector like:

A = c(1,2,3,4,5,6)

print(A) — here A will be a numeric dataset and all values given by combine function will get printed

11. new = read.csv(file.choose(),header=T) — this will open a window to browse for the *.csv file to import inside the Rstudio.

12. help() — will give the help manual about the command you provide in it.

13. apropos() — will give all set of commands related to the keyword you provide in it. e.i apropos(“ls”) this will give all the commands in which “ls” appears.

14. New = data.frame() — to create new data base

15. New = edit(New) — to edit its values

16. ls() — will give the list of all the dataset and variable you declared uptill now.

17. rm(A) — to remove a database

18. B <- matrix(b,nrow=4,ncol=4,byrow=TRUE) — will define a matrix named B of 4 row and 4 cols. In which data is “b” which itself is a vector having 4*4 elements in it.

19. Some Mathematical function were.

  • sqrt()
  • abs()
  • sin()
  • cos()
  • tan()
  • pi
  • exp()
  • log()
  • gamma()
  • factorial()
  • choose() — for combination
  • length()
  • sum()
  • prod()
  • cumsum()
  • cumprod()
  • diff()
  • mean()
  • median()

—some for matrix dedicated are:

  • det() –for determinate
  • t() –transpose
  • dim() –dimension
  • solve()
  • eigen() –for eigen value

20. sort(x, decreasing = TRUE) — for soring “x” in descending order and s/TRUE/FALSE for ascending order

21. #comments can be given after writing the hash symbol.

22. seq(x,y,by=z) — to print the sequence of number from x:y, by incrementation of x

23. subset(dataset, expression) — will subset the table of dataset taking only those rows where expression gets full fill

how to concern with the Height field of trees dataset == trees$Height

24. R has logical operator to be: ! – not, | – or, & – and

25. R for remainder we use “%%” double modulus.

With these simple commands our day one was over and I was waiting desperately for the next day session to learn of it. But you don’t need to wait just scroll the page down and get more of R language.

Training day – 2

This was the day when we were introduced to a little higher level of “R”, I mean to say about looping, branching, switching and all that. And if you know C/C++, Java, JavaScript or any other PL then it would be much easier for you to understand

26 > Branching- Condition-programming with if-else, ifelse, switch:

if(expr1){

}else{ # note that you need to start the “else” just after the closing curly bracket “}”

}

27 > Condition operator is also here but with little “R” syntax:

ifelse(exp1,exp2,exp3)

28 > Switch-case is here like this:

switch(expr, “if case is 1”, “if case is 2”, “and so on”)

29 < Now let’s see looping: for loop, while loop, repeat

30 < for(counter in array){

#this will repeat from counter == first value of array till counter == last value of array

}

31 < while(expr){

#till expr becomes false

}

32 < repeat(x,y) — will repeat “x”, “y” times.

33. print(a) — will print the value of only “a” but what if we have to use multiple values of in print then we use.

34. cat(“the result is”,a)

35. accepting input from user at runtime we use:

x = as.integer(readline(prompt=”Enter the Height: “)) — will accept in “x” and with a prompt message.

36. break — statement used to break the loop

37. next — statement used to move to next iteration

38. jump, etc.

39. getwd() — to get the current working directory

40. setwd() — to set a new working directory

41. write.table(trees,”/home/coderunner/Public”,quote=F) — to print data set in an export table which you may use for later representation.

42. out=capture.output(attributes(trees)) — to capture the output of a command and then

cat(“my title”,out,file=”summary of text”,sep=”n”,append=TRUE) –print that output in a file

43. object.size(data_set) — will return the size of dataset in bytes

44. then we worked on an imported dataset the working shown below:-

new = read.csv(file.choose(),header=T)
#####################################
#PRINT ALL MP FROM DELHI
subset(mp, mp$State == "Delhi")


#PRINT ALL MP NAME "Kumar"
View(subset(mp,grepl(glob2rx("*kumar *"),mp$`Name of Member`)))


#PRINT THE MP WITH HIGHEST AND MINIMUM ATTENDANCE IN PARLIAMENT
a = "[^A-Z]$"
result = subset(mp,grepl(a,mp$`No. of days member signed the register`))
b = (result$'No. of days member signed the register')
a1 = as.numeric(b)
d=min(a1)
e=max(a1)
f = subset(mp,mp$`No. of days member signed the register`==e)
View(f)
g = subset(mp,mp$`No. of days member signed the register`==d)
View(g)

today not only we worked on just commands but also we created a web appliction which was interactive in nature. We did this using library “shiny”. And 2 files with predefined keyword file names that is UI.R and Server.R.we created the app’s UI in ui.r and make to interact via Server.r

after creating a local web app, we did run it on our local host ip 127.0.0.1

in UI.R we did code:-

fluidPage(
 titlePanel("Sliders"),
 sidebarLayout(
 sidebarPanel(
 sliderInput( "num","integer", 1, 20, 1, step = 1, animate = animationOptions(interval = 400, loop = TRUE))),
 mainPanel(
 tableOutput("prod")
 )
 )
)

in Server.R we did code:-

function(input, output){
 output$prod <- renderPrint({ x <- input$num
 a=1
 for(i in x:1){
 a=a*i}
 cat(x,"factorial is =",a,"<br>")
 })
}

and the output was like this:-

trainingblog
It was on a web page at local host IP

With these commands, practical approach and lots of practice the second day was over by now. And we were informed that we’ll do graphics tomorrow that maintained my excitement for the next day.

Training day – 3

This was the day dedicated to graphics in R. We were going to learn plotting and animation.

45. demo(“graphics”) –to see demo of what you may do in graphics in R

46. apropos(“plot”) — to know all the commands that includes plot in their name

#plot a simple pie chart
x = c(50,62,40,53)
labels = c("london","newyork","singapore","mumbai")
pie(x,labels)

#plot a colorfull pie chart
y = c("red","white","green","orange")
pie(x,labels,col=y,radius = .7,main = "Popular Cities",clockwise = FALSE)

#change background color of plotting
par(bg = "yellow")

#changing it to percentage with name + percentage + "%"
piepercent=round(100*x/sum(x),1)
pie(x,labels = paste(labels,piepercent,"%"),col=y,radius = .7, main = "Popular Cities",clockwise = FALSE)

#giving legends of different part in the plot
legend("right",labels, cex=0.7,fill = rainbow(length(x)))
help(legend)

#plotrix for 3-d pie
library("plotrix")

x = c(1,2,3,4)
ABC = c("a","c","b","d")
pie3D(x,labels=ABC,explode=0.2,main="3D pie char of countries")

#bar plot
h = c(7,12,28,2,41)
m = c("mar","apr","may","jun","jul")
barplot(h,names.arg = m,xlab = "month", ylab="revenue", col = "blue", main = "revenue chart",border = "red", xlim = c(0,20))

#each bar with different parameter
colors = c("green","orange","brown")
months = c("mar","apr","may","jun","jul")
regions = c("east","west","north")

values = matrix(c(2,4,5,2,5,3,7,8,9,4,7,3,2,7,5),nrow=3,ncol = 5,byrow = TRUE)

barplot(values,names.arg = m,xlab = "month",ylab = "revenue",col = colors,main = "revenue chart",border = "red")
legend("topright",regions,cex = .6,fill = colors)

summary(trees)      --will give basic info about the trees’ dataset
View(mtcars)

#box plot
input=mtcars[1:5,c('mpg','cyl')]
View(input)

boxplot(mpg ~ cyl,data=input,xlab= "number of cylinders", ylab= "miles per gallon", main="mileage data")

input = mtcars[,c('mpg','cyl')]
print(tail(input))

boxplot(mpg ~ cyl, data=input,xlab="number of cylinders", ylab = "miles per gallon", notch=FALSE, varwidth=TRUE,col=c("green","yellow","purple"),names=c("low","medium","high"))


#histogram
v = c(3,6,8,2,7,5,9,4,6,10,13,14,18,35,45,46,47,51,52,53,54,55,56,60,30)
hist(v,xlab = "weight",col = "yellow", border = "blue")

#line plot
#p=point, l - line,o - both

v = c(5,8,1,10,6)
plot(v,type = "p",labels = c("a","b","c","d","e"))

#scatter plot
plot(trees)

#dot plot
dotchart(t(c(12,23,23)),color=c("red","blue","darkgreen"),main = "Dotchart for autos", cex=0.8)

#make an empty chart
plot(1,1,xlim=c(1,5.5),ylim = c(0,7),type="n",ann=FALSE)

#plot digits 0-4 with increasing size and color
text(1:5, rep(6,5), labels = c(0:4), cex = 1:5, col=1:5)

#printing symbols
points(1:5,rep(5,5),cex=1:10,col=1:5,pch=0:4)
text((1:5)+0.4,rep(5,5),cex=0.6,(0:4))

rawToChar(as.raw(59))    #to convert int to char

a=charToRaw(as.character(";"))    #to convert char to int
as.numeric(a)

intToUtf8(59)
utf8ToInt(";")

#ANIMATION
library(animation)

#https://yihui.name/animation/
#examples of animation

ani.options(interval = 0.1, nmax = 50)
par(mar = rep(0.5,4))
BM.circle(cex=1:5,pch=1:5)

par(bg="white")
par(fg = "red")
MC.hitormiss(from = 0.4, to =0.5 )$est

par(mar = c(1,1,0.2,0.2))
Rosling.bubbles()

ani.options(interval= 0.1,nmax = max(trees$Height) )
par(mar = c(1,1,0.2,0.2))
Rosling.bubbles()

# tyring more
Rosling.bubbles(type = "rectangles", data = matrix(abs(rnorm(50*10*2)),ncol = 2))

ani.options(interval=0.2,nmax=25)
flip.coin(faces = c("Head","Stand","Tail"),type = "n", prob = c(0.45,0.1,0.45), col = c(9,2,4))

#  Now creating functions in R
new.function1=function(a){
  for(i in 1:a){
    b=i^2
    print(b)
  }
}
new.function1(6)   #function call

x = c(1,2,3,4,5,5,5,6,6,7,8,9)

#function to find mode of a data
max = 0
new.function2=function(a){
  for(i in a){
    count = length(which(a==i))
    if(max<count){
      max=count
      mode = i
    }
  }
  cat("mode: ",mode)
}

new.function2(x)

result = nchar("count the number of characters")
print(result)

result = toupper("changing to upper")
print(result)

result = toupper(substring("extract",5,7))
print(result)

result = substring("extract",5,7)
print(result)
names(sort(-table(x)))[1]

View(USArrests)

states = rownames(USArrests)
substr(x = states)

#now questions
#print all the states that contain "D" or "d" letter in them
subset(states,grepl(glob2rx("*d|D*"),states))

#print all the states which contain 2 values togather (ex: New Delhi)
subset(states,grepl(glob2rx("* *"),states))

#longest and shortest name states
states = rownames(USArrests)

long=max(nchar(states))
short=min(nchar(states))
cat("Longest names are: ",subset(states,nchar(states)==long)," \nShortest names are: ",subset(states,nchar(states)==short))

With this the Third day was over, and we were heading towards day 4. Which was going to be the last and the most exciting day of training thus I wanted to put all my effort for that day. Because it was planned that in first half we’ll do advanced “R” and in the next half we’ll be creating our project on a real dataset.

Training day – 4

>>our first task was to learn how make predictions on the basis of given dataset and we proceeded as follows

height = c(151,174,138,186,128,136,179,163)
weight = c(63, 81, 56, 91, 47 ,57, 76, 72)

relation <- lm(weight~height)            #this will create a relation between weight and height

#find weight of a person with height 170 It is a prediction.
a = data.frame(height=170)
result = predict(relation,a)
print(result)

#plot of regression
plot(y,x,col="red", main = "Height and Weight Regression", abline(lm(x~y)), cex = 1.3, pch = 65:73,
     xlab = "Weight in kg", ylab = "Height in cm")
par(bg = "white")


#multi-predicting on a dataset  mtcars is a default dataset which comes with an ext. Package.
#when one field is dependent on more than one field
input = mtcars[,c("mpg","disp","hp","wt")]
View(input)
print(head(input))
model <- lm(mpg~disp+hp+wt, data = input)
print(model)

a = coef(model)[1]
Xdisp = coef(model)[2]
Xhp = coef(model)[3]
Xwt = coef(model)[4]

print(Xdisp)
print(Xhp)
print(Xwt)

#for a car with disp = 221, hp = 102 and wt = 2.91 the predicted mileage is:
#Y = 37.15 + (-0.000937)*221 + (-0.0311)*102 + (-3.8008)*2.91 = 22.7104

Y = a + Xdisp*221 + Xhp*102 + Xwt*2.91
print(Y)

library("party")     #package for decision tree 
data()
View(readingSkills)  #reading skills comes with “party” package.

A <- ctree(nativeSpeaker~ age + shoeSize + score, data = readingSkills)
plot(A)

A = readingSkills[c(1:105),]
B <- ctree(nativeSpeaker ~ age + shoeSize + score, data = A)
plot(B)

#tutorialspoint
rainfall <- c(799,1174.8,865.1,1334.6,635.4,918.5,695.5,998.6,784.2,985,882.8,1071)
rainfall.timeseries <- ts(rainfall,start = c(2017,1),frequency = 12)

plot(rainfall.timeseries)
# freq:  12  then months
# freq:  4    then quarter of an year
# freq:  6    then every 10 min. of an hour
# freq:  24 * 6   then every 10 minutes of the day

rainfall2 <- c(800,832,1010,789,709,476,889,600,409,809,999,909)
combined.rainfall <- matrix(c(rainfall,rainfall2),nrow=12)

rainfall.timeseries <- ts(combined.rainfall,start = c(2017,1),frequency = 12)
plot(rainfall.timeseries)

And this was the end of first half of the last day of training. There was a break for lunch. And soon the session resumed.

We all were given the dataset (a real one) to work on it. But the twist was that: we were not given the queries to be performed on them. And its objective was to make the understanding with us that is real-time projects no one gives us questions for analysis. If we work as a data science official for a company. Then our work is to query the dataset in such a way that it will be beneficial for the company.

So now, a dataset was keeping in front of us, and I was ready to charge. I and my team completed the project with 10 questions their coding in R their outputs and their benefits. You may see my project Here.

At the end of the session, the moment arrived which I never wanted to come, It was the end of the most wonderful 4 days of my life. Sir, bid all us a goodbye. We gave him our feedback verbally and thanked him a lot. I personally would like to thank him here. Sir, you were awesome! And we thank you from bottom of our heart.

I wish to attend more such event later in my life…

And I believe I’ll get them soon. Because “Where there’s the will, There’s the way!”…

Thank you so much for reading, I hope it would have been helpful for you as well.

See you in the next blog, till then this, is GeekyShacklebolt

bidding you goodbye!

One thought on “Data-Science Training Session

Leave a comment