bivariate analysis is part of multivariate analysis, defined as the observation and analysis of more than one outcome variable
reminder, data is classified as either categorical or numerical.
in bivariate data, we analyze the distributions of two data columns instead of just on.
this can be used to find correlations between two variables
we put the explanatory variable in the x axis and the response variable on the y axis
this is an attempt to see the correlation between x and y such that a change in x might cause a change in y
if both the explanatory variable and the response variable are categorical, we need to use tables to represent their relationships
for example: consider the following data:
| basketball size | basketball color |
|---|---|
| small | red |
| big | red |
| small | blue |
| big | blue |
| small | blue |
| big | red |
| big | red |
we need to represent this data in a table that represents the
| big | small | |
|---|---|---|
| red | 3/4 | 1/3 |
| blue | 1/4 | 2/3 |
| total | 1 | 1 |
this way, we find that the size of a basketball can relate to the basketball color, because a 3/4 of big basketballs are red, much larger compared to small basketballs, 1/3
if both the explanatory variable and the response variable are numerical, we need to use a scatterplot to represent their relationships
consider the following data:
| books read | test score |
|---|---|
| 11 | 30 |
| 9 | 90 |
| 6 | 40 |
| 12 | 80 |
| 3 | 60 |
| 5 | 40 |
yeah, scatter plots work
the calculators can calculate the linear regression line, also called the line of best fit
this line is useful to make predictions of one of the variables when you have another
if that wasn’t clear, for example, for a person who has read 6 books, you can kinda predict that they will get a test score of about 50
if the calculator solved for the equation of the linear regression line, we just substitute and solve for the prediction y value
predictions are either interpolation, if the prediction input is within the range of your data, and extrapolation, if the prediction input is outside the range of your data
by predicting , it is within the range, which is more accurate than predicting , which is outside the range
the Pearson’s correlation constant, depicted as , shows you how strongly the correlation is, at a scale of 0 to 1, from 0 meaning no association at all, to 1 meaning perfectly straight linear association
heres another table about and association strength because why not
| || | association |
|---|---|
| 0-0.24 | no association |
| 0.25-0.49 | weak association |
| 0.5-0.74 | moderate association |
| 0.75-0.99 | strong association |
| 1 | perfect association |
for example, in the dataset above, the calculator says that , meaning that it’s a weak association
additionally, if the correlation is negative, meaning that the slope of the regression line is negative, the value is negative as well