PART A

1)

Develop an r program to quickly explore a given dataset , including categorical analysis using the graph_by() command, and visualize the findings using ggplot2.

what we will do In this program, we will:

  1. Load the required libraries and dataset.

  2. Explore the structure of the dataset.

  3. Convert a numerical variable into a categorical variable.

  4. Perform categorical analysis using group_by() and summarize().

  5. Visual the result using ggplot2.

step1 : Load required libraries and dataset

  • tidyverse is a collection of packages for data science.

  • dplyr is used for grouping and summarizing data.

{r} library(tidyverse)
 library(dplyr) 
 data <- mtcars

Step 2: Explore the dataset

Before performing any analysis, we should understand the dataset.

we will check:

  • Number of rows and columns

  • Columns name

  • Data types

  • Summary statistics

  • First few rows

{r} # Dimension (rows and columns) dim(data) # Column names names(data) # structure of dataset str(data) # Summary statistics summary(data) # First six rows head(data)

Step 3: Convert numeric variable to categorical

the variabe cyl represent the number of cylinders in a car.

ASlthought it is numeric (4,6,8), it reepresent categoreies.
for categorical analysis,we convert it into factor

{r} #convert 'cyl' to factor
 data$cyl <- as.factor(data$cyl) 

 # Confirm conversion 
str(data$cyl) 
levels(data$cyl)

Step 4:perform categorical analysis

we calculate the average mile per gallon(mpg) for each cylinder category.

how this function work together

  • `%>% passes output from one function to the next.

  • group_by(cyl) splits the datasets into groups.

  • summarize() calculate statistics per group.

  • mean(mpg) computes average milage.

  • .groups = "drop" removes grouping afterward.

{r} summary_data <- data %>%
   group_by(cyl) %>% 
 summarize( 
 avg_mpg =mean(mpg),
 .groups = "drop" ) 
 summary_data

step 6:Visualize using a barplot

{r} ggplot(summary_data,aes(x=cyl,y= avg_mpg , fill = cyl))+ geom_bar(stat = "identity")+ labs( title ="average MPG by Cylinder count", x="number of cylinders", y="average MPG" ) + theme_minimal() 

2)

̥--- title: "prg2" author: "kiran" format: docx editor: visual ---

write an r script to create scatter pplot, incorporating categorical anlysis through colour-coded data points representing diffrent groups, using ggplot 2.

Step 1: Load libraries

We load two libraries:

  • ggplot 2 is used to build plots layer by layer (we will use it to create the scatter plot).

  • dplyr provides functions for exploring and summarizeing sata (we will use it to understand the categoreis in the dataset).

{r} library(ggplot2) library(dplyr)

Step 2:Load the dataset(iris)

we use the built-in dataset `iris.

what this dataset contain:

  • Each row in one flower sample (an observation)

  • there are 150 total observations.

  • the columns species is a categorical variable with 3 groups:

    • setosa

    • versicolor

    • verginica

  • the columns Sepal.Lenght and Sepal.width are numeric measurements that we will plot.

{r} data = iris head(data,10)
{r} tail(data)
{r} str(data)
{r} summary(data)
{r} names(data)
{r} dim(data)
{r} data[1]
{r} data$Sepal.Length
{r} typeof(data$Sepal.Length)
{r} typeof(data[1])
{r} data[][3]
{r} data[150,5]
{r} data[1:5,1:3]
{r} data[][5]
{r} data$Species
{r} table(data$Species)

Step 5:Create a basic scatter plot(no ctaegories yet)

A scatter plot shows the relationship between two numerical varible.

here we plot:

  • x-axis: Sepal.Length

  • y-axis: Sepal.Width

important point:

  • Each dot represent one flower (one row in the dataset).

{r} ggplot(data , aes(x = Sepal.Length,y = Sepal.Width)) + geom_point()

Step 6: Add a categorical grouping using color + species

Now we include categorical variable:

  • color = Species tells ggplot to assign a different color to each species.

What changes?

  • the plot now visually separate the three species based on color.

  • this is the main "categorical analysis" isea: we can see if different groups clusters differently.

{r} ggplot(data , aes(x = Sepal.Length,y = Sepal.Width, color = Species)) + geom_point()

Step 7: Improve point visibality (size and transperency)

we adjust how point looks:

  • suizw + 3 makes each dot bigger, so it is easier to see.

  • alpha = 0.7 makes dot slightly transparent,which helps when point overlap.

why transparency helps:

  • If many points overlap in same region, transparency makes dance areas more visible.

{r} ggplot(data , aes(x = Sepal.Length,y = Sepal.Width, color = Species)) + geom_point(size = 3,alpha = 0.7)

Step 8: add informative labels(title , axes, legend)

good plots should clearly commmunicate the what viewer is seeing.

labs() adds;

  • title for the plot heading

  • x and y axis labels

  • color legend title(so the legend has meaningful name)

{r} ggplot(data , aes(x = Sepal.Length,y = Sepal.Width, color = Species)) + geom_point(size = 3,alpha = 0.5) + labs( title = "Scatter plot of sepal dimensions", x = "Sepal length", y = "Sepal width", color = "Species" )

Step 9: Apply a clean theme and move the legend

themes control the background, grids , and text styling.

  • theme_minimal() removes heavy background and gives a clean look.

  • theme(legend.positiion = "Top) moves the legend apove the plot.

Why move the legend?

  • When the legend is at the top, it is often easier to notice and read ,especially in presentations.

{r} ggplot(data , aes(x = Sepal.Length,y = Sepal.Width, color = Species)) + geom_point(size = 3,alpha = 0.7) + labs( title = "Scatter plot of sepal dimensions", x = "Sepal length", y = "Sepal width", color = "Species" )+ theme_minimal()+ theme(legend.position = "Top")


4)

--- title: "prg4" author: "Kiran T L" format: docx editor: visual ---̥

Problem statement : Develop a R script to produce a bar graph displaying the frequency distribution of categorical data, grouped by a specific variable using ggplot2


##Step 1:Load required Libraries

{r} #insatll.packages('ggplot2') library(ggplot2)

##Step 2: LOad and inspect the dataset

We load the builtin dataset mtcars and view the first few row to understand its structure

{r} data=mtcars head(mtcars)

##step3 : Exploratory data analysis

Before creating any visualization, we explore teh dataset to understand the variable and types.

{r} str(data)
{r} summary(data)
  • str(data) helps us to identify the data type of each variable

  • summary(data) provides statistical summaries

Step 4: convert the number variable to factors

To correctly visualize categorical data, we convert relevant variable into factors

{r} data$cyl
{r} table(data$cyl)
{r} data$gear
{r} table(data$gear)
{r} class(data$cyl)
{r} class(data$gear)
{r} data$cyl=as.factor(data$cyl) data$gear=as.factor(data$gear)
{r} class(data$cyl)
{r} data$cyl
{r} class(data$gear)
{r} data$gear
{r} summary(data)
{r} str(data)

Step 5 : Examine frequency Distribution

Before ploptting, we analyze how the data is distributed across categories

{r} table(data$cyl)
{r} table(data$gear)
{r} table(data$cyl ,data$gear )
  • Helps us to undderstand the count of each category

  • provide insight into relationships between variable

  • Prepare us for interpreting the visualization

Step 6: Create the bar graph

{r} ggplot(data, aes(x=cyl, fill=gear ))
{r} ggplot(data, aes(x=cyl, fill=gear ))+geom_bar()
{r} ggplot(data, aes(x=cyl, fill=gear ))+geom_bar(position="dodge")
{r} ggplot(data, aes(x=cyl, fill=gear ))+geom_bar(position="dodge")+ theme_minimal()+ theme(legend.position = 'top')
{r} ggplot(data, aes(x=cyl, fill=gear ))+geom_bar(position="dodge")+ theme_minimal()+ labs( title="bar dispalying the frequency distribution of categorical data", y="Count", x="number of cylinders")+ theme_minimal()+ theme(legend.position = 'top')

5)

# Load library
library(ggplot2)

# Use built-in dataset
data <- iris

# Convert Species to factor (grouping variable)
data$Species <- as.factor(data$Species)

# Create histogram with density curves
ggplot(data, aes(x = Sepal.Length, fill = Species, color = Species)) +
  
  # Histogram (density scaled)
  geom_histogram(aes(y = after_stat(density)),
                 position = "identity",
                 alpha = 0.4,
                 bins = 30) +
  
  # Density curves for each group
  geom_density(alpha = 0.8, linewidth = 1) +
  
  # Labels
  labs(
    title = "Distribution of Sepal Length with Density Curves by Species",
    x = "Sepal Length",
    y = "Density"
  ) +
  
  # Theme
  theme_minimal()


6)

--- title: "prg6" author: "Kiran T L" format: docx editor: visual ---

Write an R script to construct a box plot showcasing the distribution of a continuous variable , grouped by a categorical variable , using ggplote's fill aesthetic.

step 1 : load Required Library

{r} #Load ggplot2 package for visualization library(ggplot2)

Step 2: Explore the Inbuilt Dateset

{r} #use the built-in 'iris' dataset # 'Petal.Width' is a continuous variable # 'species' is a categorical grouping variiable str(iris) head(iris)

Step 3: Construct Box plot with Grouping

step 3.1:

{r} # Initialize ggplot withh data and aesthitic mappings p= ggplot(data = iris, aes(x = Species, y = Petal.Width, fill = Species)) p
{r} # add the box plot layer p = p + geom_boxplot() p
{r} p = p + geom_boxplot()+ theme_minimal()+ theme(legend.position = 'top')+ labs(title = "IRIS box plot", x='Species',y = 'petal.Width') p


7)

--- title: "prg7" author: "Kiran T L" format: docx editor: visual ---

Propgram

Develop a function in R to plot a function curve based on a mathematical equation provided as input, with different curve style for each year , using ggplot2.

Objectives

Step1: Load required Library

{r} # Load ggplot2 package for advanced plotting library(ggplot2)

We use the ggplot2 package because it allows elegant and flexible plotting. It supports layering and grouping very well.

##Step2: create data for the functions

{r} #create a sequence of x values ranging from -2pai to 2pai x <- seq(-2*pi,2*pi, length.out = 500) #evaluate sin() and cos(x) over the x range y1 <- sin(x) y2 <- cos(x) #combine data into one data frame df <- data.frame( x = rep(x,2), y = c(y1,y2), group = rep(c("sin(x)","cos(x)"), each = length(x)) )

Step 3.1: Initialize the ggplot Object

{r} ##Strat building the ggplot using the dataframe and aesthetics p = ggplot(df, aes(x = x, y = y, color = group, linetype = group)) p

Step 3.2: Add the line Geometry

{r} # Add smooth lines to represent each function curve p = p + geom_line(size = 1.2) p

Step 3.3

{r} # Add title, ax is labels, and legends p <- p + labs(title = "Funtion Curves: sin(x) and cos(x)", x = "x", y = "y = f(x)", color = "Function", linetype = "Function") p

Step 3.4: Apply a Clean Theme

{r} # Use a clean and simple background theme p <- p + theme_minimal() p


Comments

Popular posts from this blog

..