Exploring SAT Scores of New York City: Visualization & Analysis¶
Part One of this project focused entirely on preparing our data for analysis. While not the most exciting task, it is necessary to some degree for nearly every real-world data set. For the second half of this project, our goal is to understand how differences in school, borough and district relate to a school's mean SAT score and whether these differences actually have any meaningful correlation to the test scores. We will utilize several visualizations, some basic correlation analysis and hypothesis testing to help us discover trends and examine the significance of those trends.
Preliminary Analysis¶
Up to this point, we haven't set any research questions for our project other than the broad inquiry of whether we can identify any outside factors influencing SAT scores in New York City. Despite having only explored datasets in isolation, we should define some narrower research questions at this point in order to keep our analysis focused. As with all research, we may discover interesting findings outside of our intended scope along the way; however, having concrete questions that we aim to answer will ensure that we do not get lost amidst the numerous possibilities. In this notebook we will explore five potential influencing demographics: sex/gender, cultural, socioeconomic, concentration of advanced/high-achieving students, and physical location of the school.
The research questions we will attempt to answer are as follows:
- Is a school's average SAT score correlated to... :
- ...the location of a school (borough or district)?
- ...the proportion of males to females attending that school?
- ...any cultural aspects of its student population (e.g., ethnicity, English proficiency)?
- ...the socioeconomic spread of its students?
- ...the proportion of high-achieving or advanced placement students?
- Are any of these correlations strong and significant enough to warrant looking into further?
Setting the Stage with Maps¶
Just as in our first notebook, we know we are likely to use NumPy and Pandas, so we will start by importing those and setting up Pandas to display up to 500 rows or columns of a table. This time, rather than having to read in nine separate datasets, we only need to take a look at our clean data.
import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
full = pd.read_csv('sat_data_clean.csv')
full.head()
To this point, we have only worked with numbers and words. However, data is often best understood and communicated through images. Since we have geodata at our disposal, we can draw up some maps to gain a 20,000-foot overview of the schools and school districts that is difficult to picture by simply reading the numbers.
For this task we will use Folium, a Python library that allows us to create interactive maps with the power of the leaflet.js JavaScript library. Our first map will be a map of New York City with a marker identifying each school. If the map is zoomed in, we can see individual schools, while zooming out displays clusters that indicate the number of schools in the area.
import folium
from folium import plugins # Needed for adding a marker cluster & heatmap
schools_map = folium.Map(location=[full['lat'].mean(), full['lon'].mean()], zoom_start=10)
marker_cluster = plugins.MarkerCluster().add_to(schools_map)
for name, row in full.iterrows():
# add markers that display DBN and school name when clicked
folium.Marker([row['lat'], row['lon']], popup='{0}: {1}'.format(row['DBN'], row['SCHOOL NAME'])).add_to(marker_cluster)
schools_map.save('schools.html')
schools_map
We can also create a heatmap with Folium for a different view of the concentration of schools across the city. Once again, zooming in provides a finer level of detail.
schools_heatmap = folium.Map(location=[full['lat'].mean(), full['lon'].mean()], zoom_start=10)
schools_heatmap.add_child(plugins.HeatMap([[row["lat"], row["lon"]] for name, row in full.iterrows()]))
schools_heatmap.save("heatmap.html")
schools_heatmap
Now that we have a general idea of where schools are located throughout the city, we can take a closer look at where each district lies. To do this, we will create a map using the GeoJSON file we downloaded at the beginning of our first notebook.
One advantage that Folium provides is the ability to apply a fully-integrated choropleth layer over our map. Using our GeoJSON file to define the district boundaries within the choropleth, we can view values of any column from our data in a color spectrum on our map. However, in order to do this, we need to re-format our data to match the format of our JSON data. Opening up the GeoJSON file we see that each district is indicated by a string containing the (non-zero-padded) district number.
{"type":"Feature","properties":{"school_dist":"6"...
The following code groups our full
dataset by the school district, converts our school_dist
column into strings and strips the zeros from in front of districts 1 through 9.
district_data = full.groupby('school_dist').mean().reset_index()
district_data['school_dist'] = district_data['school_dist'].apply(lambda x: str(int(x)))
district_data.to_csv('district_data.csv', index=False) # Saving to CSV for later use in Tableau
district_data
For our first school district choropleth map we will compare SAT scores among districts. Because we may later want to compare a different feature (and creating a map from scratch each time is somewhat time-consuming) we will write a function that can be reused for any column of district_data
.
# Takes any column of district_data and returns a district choropleth map with that column
def show_district_map(col):
geo_path = 'districts.geojson'
districts = folium.Map(location=[full['lat'].mean(), full['lon'].mean()], zoom_start=10)
choropleth = folium.features.Choropleth(
geo_data=geo_path,
data=district_data,
columns=['school_dist', col],
key_on='feature.properties.school_dist', # found in GeoJSON file
fill_color='YlGn',
fill_opacity=0.7,
line_opacity=0.2,
legend_name='Average {}'.format(col),
highlight=True
).add_to(districts)
dist_markers = district_data.iloc[:32] # to avoid physical markers for Districts 75 and 79
for name, row in dist_markers.iterrows(): # adds circle marker with hover displaying district
folium.CircleMarker(
location=[row['lat'], row['lon']],
tooltip='{0}: {1}'.format('District', row['school_dist']),
radius=2).add_to(districts)
return districts
show_district_map('sat_score')
Already we see a noticeable difference in SAT scores from district to district. Districts 9, 12, 16, 18, 19, 23 and 32 have an average SAT score between 1085 and 1136, while districts 22, 26 and 31 average roughly 300 points higher. We will take a closer look into this below, but for now it is worth noting the strength of this map as a communication device. From one glance we can determine the lowest- and highest-scoring districts, where they lie geographically and a rough estimate on the spread between district scores. No other visualization would have provided this much easily digestible information, so it was well worth the time it took to fine-tune the map to our liking.
Finding Relationships in Our Data¶
With our maps as reference points, we now want to dive a little deeper into the features of our data. We'll begin by taking a look at the shape of our SAT data. Our map shows a range of 1085 to 1389, but we know the range of scores is much greater. When this data was compiled by the City of New York, the minimum score possible on the SAT was 600, while the maximum possible was 2400. As with all standardized tests, after a certain threshold it gets exponentially harder to achieve the higher scores so we are unlikely to see many schools with averages on the higher end. Thus, we can likely expect our distribution to be skewed right to some degree. Let's take a look with a simple histogram.
Outside of maps we'll use a combination of three plotting techniques for our visualizations moving forward, depending on our needs and which method makes the quickest or cleanest plot. We will use the Pandas built-in plotting method for "quick and dirty" plots, Seaborn for clean-looking plots requiring only a little customization, and Matplotlib for anything that needs a significant amount of tweaking. The Pandas method and Seaborn are both based on Matplotlib, but each sacrifices customizability for ease of creation to some degree.
To start, we import Matplotlib and Seaborn and enable our Jupyter notebook to show the plots as output. We can also set the style of our plots so that they all look similar regardless of the plotting technique we use. For our histogram, the Pandas plotting method will suffice.
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
plt.style.use('seaborn') # All plots we make will use this style
full['sat_score'].hist(bins=20)
Just as expected, the majority of the schools had average SAT scores between 1000 and 1400, giving us a skew-right distribution. Having confirmed the distribution of SAT scores in our data, we will dig deeper into potential correlations.
Location of School¶
We have already noticed a stark difference between certain school districts' average SAT scores from our maps. To get a more detailed view of the disparity among school districts we can use boxplots. Seaborn does an excellent job at making quick and polished boxplots, so we'll use it here.
plt.style.use('seaborn')
f, ax = plt.subplots(figsize=(16, 12))
ax = sns.boxplot(data=full, x='sat_score', y='District', showmeans=True, meanline=True) # Red dotted line indicates the mean
plt.show()
As with our map, we notice right away that the median SAT score (the solid black line inside each box) is higher for districts 22, 26 and 31 than any others. We also see that the lowest scores for these districts are higher than the lowest scores for any of the other districts and that of the three, only District 31 has a high-scoring outlier school. We can say with certainty that a noticable difference in district SAT performance exists. However, we still don't know if that difference is significant.
Before we attempt to answer that question, we will continue exploring correlations in location. Upon a closer look, we see that our three top-scoring districts are all from different boroughs: District 22 is in Brooklyn, District 26 is in Queens and District 31 is the district for all of Staten Island.
Does this mean that we won't see much of a difference in performance by borough? A boxplot by borough can help clear that up.
f, ax = plt.subplots(figsize=(16, 12))
ax = sns.boxplot(data=full, x='sat_score', y='borough', palette='muted', showmeans=True, meanline=True)
plt.show()
Interestingly, even with the three highest-scoring districts being spread accross different boroughs, it appears the borough a school is located in may still be related in some way to its student body's performance on the SAT. This is likely influenced by a number of factors, such as outlier schools or the fact that Staten Island is comprised of only one district (District 31) and 13 schools. For the moment we can say that schools in Staten Island, Queens and Manhattan tend, on average, to have higher-scoring SAT test takers, but to gain a better understanding of the underlying factors we will have to explore the other features of our dataset.
Narrowing Down Numerical Features¶
To investigate what factors may be driving differences among boroughs and districts we will look for correlations between our sat_score
column and the other numerical columns from the full
dataset. The default correlation test in Pandas' .corr()
method is the Pearson correlation coefficient (or Pearson's r). Pearson's r is a measure of the linear correlation between two data features.
Scores of 1 or -1 indicate perfect positive or negative linear relationships respectively, while a score of 0 indicates no relationship. Generally speaking, r scores of $\pm$ 0.6 are considered strong correlations and anything below $\pm$ 0.3 weak correlations.
full.corr()['sat_score'].sort_values()
Looking at the top positive and negative results, we see that our AP columns all have an r-value greater than 0.5. However, they only represent the raw number of tests/test takers and not a percentage. Columns showing 1) the percentage of AP test takers per school and 2) the percentage of AP exams at each school with scores of 3 or above will help account for size differences among schools and may tell a different story. We can add these new columns using the code below.
full['pct_AP'] = full['AP Test Takers '] / full['total_enrollment']
full['pct_3_plus'] = (full['Number of Exams with scores 3 4 or 5'] / full['Total Exams Taken']).fillna(0)
full.corr()['sat_score'].sort_values().tail(20)
What if we want to know more than the r-value? A particularly useful Python library for more extensive statistical analysis is called Pingouin. Pingouin has many of the statistical tests found in SciPy and Statsmodels, but also seeks to bring to Python statistical tests that were previously available only in R or Matlab. It is also built to use directly with Pandas, which makes it generally more straightforward to use than SciPy. We will import it using the code below.
import pingouin as pg
Running a correlation test in Pingouin returns several other useful figures besides the r value. In particular, Pingouin returns the p-value (p-unc
) and Bayes Factor (BF10
) for each pair of features tested. Both of these values are measures of statistical significance. Generally, we reject the null hypothesis (that the two features have no correlation) when the p-value is less than 0.05. For the Bayes Factor, the larger the value, the stronger the evidence of correlation. If you need them, the confidence interval is also returned by default (CI95%
), as is Fisher's z (z
). Pingouin also makes it effortless to switch between correlation tests (Pearson's r, Spearman's rho, etc.), which we will explore later. For now, we will run the same basic correlation in Pingouin (Pearson method) and explore the DataFrame it outputs.
corr = pg.pairwise_corr(full, columns=['sat_score'])
corr.sort_values(by=['r']).reset_index()
With so many different features, it can be difficult to select which ones to analyze further. However, with our new table we see in the negative correlations that the Bayes Factor drops from around 96 to 3 after the twelfth feature (ell_num
), and this as good a start point as any. Below we will separate the features with the 12 strongest positive and negative r scores into their own DataFrames and combine them into a DataFrame named top_corrs
.
pos_corr = corr.sort_values(by=['r']).iloc[86:98].reset_index()
pos_corr
neg_corr = corr.sort_values(by=['r']).head(12).reset_index()
neg_corr
top_corrs = pd.concat([neg_corr, pos_corr]).reset_index().drop(columns=['level_0', 'index'])
top_corrs
To showcase the relative r-value "strengths" of these correlations, we can use a bar chart with a diverging color scheme. This can be done in Seaborn, but to harness the maximum effect of the coolwarm
colormap, we will manually tweak the colors in Matplotlib.
plt.style.use('seaborn')
xvals = range(len(top_corrs))
my_cmap = plt.cm.get_cmap('coolwarm')
colors = my_cmap([0 + (x * 0.045) for x in range(24)]) # fully saturates colors at opposing ends of our data
f, ax = plt.subplots(figsize=(12, 10))
tops = plt.bar(x=xvals, height=top_corrs['r'], color=colors)
ax.set_xticks(xvals)
ax.set_xticklabels(top_corrs['Y'], rotation=90)
ax.set_xlabel('Feature')
ax.set_ylabel('Pearson Correlation Coefficient')
plt.show()
Our bar chart excels at showing the the r-values, but what do they really mean? Scatter plots are often used to vizualize the relationship indicated by Pearson's r. The stronger the (linear) correlation, the more we should see an upward or downward linear trend in the data points. This will also help us identify the extent to which outliers might be affecting our r-values.
Seaborn has a handy plot (regplot
) that combines a scatter plot with a regression line of best fit, which will be perfect this job. To avoid individually making 24 graphs from scratch, we will write a function to display a grid of 12 plots at once. So we can use it later with any data/correlation tests we will include several keyword arguments and docstrings.
def graph_corrs(data, corrs, xcols='Y', ycol='sat_score', vals='r', coef='r', orientation=None, outliers=None, nrows=4, ncols=3):
'''
Plots a grid of Seaborn regplots with correlation labels.
Parameters
----------
data : Pandas DataFrame
DataFrame containing all data
corrs : Pandas DataFrame
DataFrame containing correlations we want to graph.
xcols : str
`corrs` column with `data` feature names to be plotted
ycol : str
`data` column name for y axis feature
vals : str
`corrs` column with correlation values
coef : str
Symbol for correlation, e.g. 'r', 'rho', "r'", 'pi'
orientation : {'none', 'negative', 'positive'}:
Orientation of correlation (for color of regression line)
outliers : list
List of `data` column names that indicate outliers. Must be in same order as `xcols`.
nrows : int
Number of rows for subplots
ncols : int
Number of columns for subplots
'''
# Line Color
if orientation == 'negative':
color = 'tab:red'
elif orientation == 'positive':
color = 'tab:green'
else:
color = 'tab:cyan'
fig, axes = plt.subplots(nrows=nrows, ncols=ncols, figsize=(16, 16), sharey=True)
for x in range(len(corrs)):
# Scatter/outlier colors
if coef == 'pi':
outlier_colors = ['r' if point == True else 'b' for point in full[outliers[x]]]
scatter = dict(color=outlier_colors)
else:
scatter=None
g = sns.regplot(data=data,
y=ycol,
x=(corrs[xcols][x]),
color='b',
marker='+',
ax=fig.get_axes()[x],
label='{} = {}'.format(coef, corrs[vals][x].round(3)),
line_kws=dict(color=color, label='Regression Line'),
scatter_kws=scatter
)
g.legend(loc='best')
plt.tight_layout()
plt.ylim(800, 2200)
Because of our docstrings, we can now remind ourselves which kewords we need by typing the following in our Jupyter Notebook:
?graph_corrs
Exploring Negative Correlations¶
If we take a look first at our neg_corrs
plots we see that some of our features have clear downward trends, while others are blatantly influenced by outliers. One feature with a relatively clear downward linear trend is frl_percent
, the percentage of students receiving free or reduced-price lunch. We will take a closer look at this later, as it is one of our only socioecomic indicators and appears to have a strong correlation to SAT score.
graph_corrs(full, neg_corr, orientation='negative')
full[['SCHOOL NAME', 'sat_score', 'borough', 'frl_percent']][full['frl_percent'] > 85]
Even those without a clear linear trend can reveal interesting information, however. Both ell_percent
and is_intl
indicate that native english speakers have a distinct advantage on the SAT: all "International" schools and schools with over 40 percent "English language learners" had mean SAT scores under 1250. Some were even among the lowest in the city, with mean scores in the 900s. Of these 28 schools, there are 10 in Manhattan, 10 in the Bronx, 4 in Queens and 4 in Brooklyn. None are in Staten Island.
full[['SCHOOL NAME', 'sat_score', 'borough']][full['ell_percent'] > 40]
full[['SCHOOL NAME', 'sat_score', 'borough']][full['is_intl'] == 1]
For hispanic_per
and black_per
, we see a slight downward linear trend, but there are a number of outlier schools with less than 20 percent Black or Hispanic students that may make the negative correlation appear stronger than it is. However, it is worth noting that if we look at the schools that have over 80 percent Black or Hispanic students we find some interesting insights:
- None of these schools are in Staten Island.
- Schools with a high percentage of Hispanic students are primarily in Manhattan or the Bronx. Half of these schools are International.
- Schools with a high percentage of Black students are overwhelmingly in Brooklyn (50 of 57), with the remainder located in Queens.
- Primarily Black schools have slightly higher SAT scores on average (1147) than primarily Hispanic schools (1085).
full[['SCHOOL NAME', 'sat_score', 'borough']][full['hispanic_per'] > 80]
full[['SCHOOL NAME', 'sat_score', 'borough']][full['black_per'] > 80]
One final aspect that stands out from our neg_corr
grid involves students with special needs. We see that the percentage of Special Ed students appears to have some negative correlation with SAT scores, but it is easy to miss the two other features that align with this finding: Local - % of grads
and Local - % of cohort
. These features don't refer to the locale of students, but rather to the type of diploma they receive.
In NYC, high school students can receive one of three different diplomas: Advanced Regents, Regents, or Local. Local diplomas are only available to qualifying students: those with Individualized Education Plans or disabilities.
The takeaway from these features, however, is not that special needs students do worse on the SAT. Rather, when we look at the schools that have more than 70 percent of students graduating with Local diplomas or that have 25 percent of students in a Special Ed program, we see two main things:
- The Bronx and Brooklyn dominate these lists again, comprising 32 of the 45 schools. (Staten Island had one school that met these criteria); and
- Even having scores on the lower end, none of these 45 schools had scores in the sub-1000 range — despite their higher than average proportion of special needs students.
full[['SCHOOL NAME', 'sat_score', 'borough', 'District']][full['Local - % of grads'] > 70]
full[['SCHOOL NAME', 'sat_score', 'borough', 'District']][full['sped_percent'] > 25]
With our initial exploration of the negative correlations done, we can begin outlining our discoveries:
- Locations (District and Borough) appear to be related to average score, but likely due to various other factors.
- Gender variables did not receive top r-values, but should be looked into.
- Cultural aspects seem to play a part. Percentages of both Black and Hispanic students may have negative correlations to SAT score, but English proficiency seems to have the more noticeable connection at the extreme ends.
- Poorer students (those who receive free or reduced-price lunch) have the most consistent correlation to low scores.
- Schools with a greater percentage of special needs students have lower, but not the lowest, scores.
Exploring Positive Correlations¶
Moving on to our positive correlations, we see that our original AP-related features show very little linear trend. On the other hand, the AP features we created in this notebook — pct_3_plus
and pct_AP
— show clearer upward trends, but the strength of their r-values may be somewhat overinflated by outliers.
graph_corrs(full, pos_corr, orientation='positive')
If we take a closer look at the schools with either a) more than 20 percent AP test takers; or b) more than 70 percent 3+ AP scores, we notice the following:
- A handful of schools had extremely high proportions of test takers scoring 3 and above, but had very low average SAT scores.
- These are schools with high percentages of English Language Learners, suggesting that either AP tests don't require the same command of the English language, or that students at these schools took AP test in math/science rather than reading/writing.
- These schools are more evenly distributed among the boroughs (except Staten Island) than the schools we examined through our
neg_corr
exploration. - Again, there is only one school from Staten Island, but this time it is the borough's only — and significant — outlier: Staten Island Technical High School.
full[['SCHOOL NAME', 'sat_score', 'borough', 'District', 'pct_3_plus']][full['pct_3_plus'] > 0.7].sort_values(by='sat_score')
full[['SCHOOL NAME', 'sat_score', 'borough', 'District', 'pct_AP']][full['pct_AP'] > 0.2].sort_values(by='sat_score')
Although our AP features weren't as promising as we'd hoped, we did see some correlation. However, another feature related to high-achieving students may shed some light on the matter. With a quick glance at our regplot grid, we see that all the schools flagged as "Specialized" have average SAT scores upwards of 1700. All except The Brooklyn Latin School also hold top spots in our AP features.
What is special about these schools? A Google search reveals that eight of the nine "Specialized" schools in our dataset are among the top 10 high schools in New York State (top 100 in the U.S.) and require eighth graders to achieve a certain rank on the Specialized High School Admissions Test prior to admission.
The ninth school on this list, Fiorello H. Laguardia High School of Music & Arts and Performing Arts, is also elite, though more emphasis is placed on their auditions than academics. Incoming students to this school only have to show "evidence of satisfactory achievement" which is defined as 75 or higher in core subjects and a 90 percent attendance rate. Emphasis on the Arts over academics may explain why this school has the lowest mean SAT score of the nine, but 1707 is still one of the highest in our dataset.
It is quite possible that this small group of elite schools is inflating our r-values for pct_AP
and pct_3_plus
.
full[['SCHOOL NAME', 'sat_score', 'borough', 'District']][full['is_specialized'] == 1].sort_values(by='sat_score')
Moving on to the cultural demographics that appear in pos_corr
, we see that SAT scores trend upward in relation to larger percentages of White and Asian students in a school. However, neither plot is particularly linear and asian_per
shows a number of highly-Asian schools in the mid-to-lower SAT score range. Upon closer inspection, we find the lower-scoring schools are once again International schools.
Schools with over 30 percent White students, on the other hand, had average SAT scores that bottomed out at 1195.
Interestingly, when comparing these ethnic demographic variables to our highly Hispanic or Black schools, we find some remarkable contrast in both location and saturation:
- Our cutoff for examining "highly" Hispanic and Black schools was 80 percent. For White and Asian student proportions it is 30 percent. There are only 3 schools in our data that are above 80 percent White or Asian.
- The majority of highly Black or Hispanic schools were located in Brooklyn, The Bronx or Manhattan. No Schools in Staten Island had over 80 percent Black or Hispanic students.
- There are more highly Asian schools in Queens than the other boroughs, though Brooklyn and Manhattan are well-represented. Staten Island and the Bronx only have one school each in this category.
- Highly White schools are more evenly spread among the boroughs — except the Bronx, where there is only one school with more than 30 percent White students. However, seven of Staten Island's 13 high schools fall into this category — more than any feature we've examined so far.
- 50+ percent Black or Hispanic student body and 30+ percent White or Asian student body are mutually exclusive.
# School info where 30+ percent Asian
full[['SCHOOL NAME', 'sat_score', 'borough', 'District', 'asian_per']][full['asian_per'] > 30].sort_values(by='borough')
# School info where 30+ percent White
full[['SCHOOL NAME', 'sat_score', 'borough', 'District', 'white_per']][full['white_per'] > 30].sort_values(by='borough')
# Number of schools with 50+ percent Black or Hisp., 30+/50+ percent White or Asian
print('Schools with 50+ percent Black or Hispanic students: {}'.format(len(full[full['black_per'] > 50]) + len(full[full['hispanic_per'] > 50])))
print('Schools with 30+ percent White or Asian students: {}'.format(len(full[full['white_per'] > 30]) + len(full[full['asian_per'] > 30])))
print('Schools with 50+ percent White or Asian students: {}'.format(len(full[full['white_per'] > 50]) + len(full[full['asian_per'] > 50])))
# Max percent of other races by type
race = ['White', 'Asian', 'Black', 'Hispanic']
feat = ['white_per', 'asian_per', 'black_per', 'hispanic_per']
race_per = [30, 30, 50, 50]
print('Max. racial percentages for schools with... \n')
for x in range(len(race)):
print('{}+ percent {} students:'.format(race_per[x], race[x]))
print(full[[f for f in feat if f != feat[x]]][full[feat[x]] >= race_per[x]].max())
print('\n')
The last standout from our pos_corr
regplots is the apparent positive correlation between the percentage of "Advanced Regents" graduates and SAT scores. Of all the plots, the Advanced Regents features had the clearest linear relationship to SAT performance. As briefly discussed earlier, students in New York high schools can earn Local, Regents or Advanced Regents diplomas. It is somewhat intuitive that Advanced Regents graduates would perform better on the SAT, as it has the strictest requirements of the three diploma types.
Schools with a high proportion of Advanced Regents graduates (40-plus percent) included all nine of our Specialized schools and were found in all five boroughs. However, only two schools each from the Bronx and Staten Island fit this classification — three of which were Specialized schools: High School of American Studies at Lehman College, Bronx High School of Science and Staten Island Technical High School.
full[['SCHOOL NAME', 'sat_score', 'borough', 'District', 'Advanced Regents - % of cohort']][full['Advanced Regents - % of cohort'] > 40].sort_values(by='borough')
full[['SCHOOL NAME', 'sat_score', 'borough', 'District', 'Advanced Regents - % of grads']][full['Advanced Regents - % of grads'] > 40].sort_values(by='borough')
Exploring the Male-Female Split¶
To this point we haven't seen any information relating to a school's male to female ratio. It is possible that there is no correlation between SAT scores and gender. After all, male_per
and female_per
received r-values of -0.096 and 0.097 respectively. As we mentioned earlier though, Pearson's r is only accurate if certain assumptions are met, and it only accounts for linear relationships. To make sure we don't rule out a gender correlation early, we should at least take a look at the scatter plots for each.
f, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5), sharey=True)
sns.scatterplot(data=full, x='male_per', y='sat_score', color='tab:blue', ax=ax1)
sns.scatterplot(data=full, x='female_per', y='sat_score', color='tab:pink', ax=ax2)
plt.tight_layout()
The scatter plots indicate virtually no linear correlation, which lines up with the r-values they received from our Pearson analysis. However, we do see that all the higher scoring schools have a male to female split between roughly 30/70 to 70/30. This is almost certainly becuase the vast majority of schools falls into that range. However, we can rule out a curvilinear relationship (in this case an upside-down U) by plotting these features with a regplot and adding the keyword argument order=2
. Changing the order
argument to a number greater than 1 prompts regplot
to call np.polyfit
under the hood and estimate a polynomial regression on our data. If these features did have a curvilinear relationship, the regression line would show it here.
f, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5), sharey=True)
sns.regplot(data=full, x='male_per', y='sat_score', order=2, ax=ax1, color='tab:blue', line_kws=dict(color='k'))
sns.regplot(data=full, x='female_per', y='sat_score', order=2, ax=ax2, color='tab:pink', line_kws=dict(color='k'))
plt.tight_layout()
Having completed our exploratory analysis, we can answer the first part of our initial inquiry:
- Both District and Borough appear to be related to average SAT score, influenced by factors such as cultural and socioeconomic makeup.
- There is likely no correlation between gender and SAT performance.
- Cultural demographics are correlated to SAT performance, though in differing degrees:
- Higher percentages of both Black and Hispanic students tend to correlate with lower SAT scores, while larger percentages of White and Asian students correlate with higher SAT scores. However, the metric for classifying schools with high Black/Hispanic students is different than that for White/Asian students: Many schools have well over 50 percent Black or Hispanic students (324 of 478), while the number of schools comprised of 50 percent White or Asian students is minimal (25 of 478; 79 schools are 30+ percent White or Asian).
- Schools with 30+ percent White or Asian students and schools with 50+ percent Black or Hispanic students are mutually exclusive
- Poor English proficiency seems to outweigh ethnic correlations — schools with high proportions of English Language Learners accounted for the lowest SAT scores regardless of racial split.
- Poorer students and neighborhoods have the most consistent correlation to low scores. Percent of students receiving free or reduced-price lunches had a clear negative correlation with SAT scores, and the Bronx — New York's poorest borough — had the most low-scoring schools in the city.
- The percentage of high-performing/advanced students is the best indicator of top SAT scores, and while schools with a greater percentage of special needs students tended to have lower scores, they were not as low as scores from International schools.
Strength and Significance of Correlations¶
Correlation to Numerical Features¶
To wrap up this project, we want to know the strength of the correlations we found and whether they are statistically significant. Earlier in this notebook we touched briefly on Pearson's Product-Moment Correlation and mentioned that it is unlikely to be accurate for our data. That is because two of the Pearson test's main assumptions are 1) bivariate normality; and 2) homoscedasticity of the data. This is a fancy way of saying that the two variables being compared need to have 1) a normal bell curve distribution; and 2) roughly equal variances.
We already know that our main variable, sat_score
, has a skew-right distribution, so we automatically break that assumption. What about our other variables though? Is sat_score
close enough to a normal distribution to use Pearson's r if our other variable is normally distributed? The following examples help illustrate the issue.
sns.jointplot(data=full, y='sat_score', x='sped_percent', kind='reg', color='r', marginal_kws=dict(color='k'))
sns.jointplot(data=full, y='sat_score', x='white_per', kind='reg', color='b', marginal_kws=dict(color='k'))
Above, we have joint plots representing two of our variables, sped_percent
and white_per
, plotted against sat_score
. The histograms on the side and top represent the shape of each variable. As we can see, sped_percent
is much closer to a normal distribution than white_per
, which is heavily skewed right. However, in both plots we see evidence of heteroscedasticity. The telltale sign of heteroscedasticity is data that fans out in a cone shape rather than a tube shape. The outliers we see in both plots due to unequal variances make it highly unlikely that Pearson's r will accurately tell us the strength of the correlation of either one.
Although this probably holds true for each of the features we are interested in, Pingouin makes it relatively painless to double check. Using pg.homoscedasticity
and pg.multivariate_normality
we can simply loop through our variables and print out the results:
neg_feats = ['frl_percent', 'Local - % of grads', 'sped_percent', 'ell_percent', 'hispanic_per', 'black_per']
pos_feats = ['pct_3_plus', 'asian_per', 'pct_AP', 'white_per', 'Advanced Regents - % of grads', 'Advanced Regents - % of cohort']
for x in neg_feats:
print(x)
print(pg.homoscedasticity(full[['sat_score', x]]), pg.multivariate_normality(full[['sat_score', x]]))
print('\n')
for x in pos_feats:
print(x)
print(pg.homoscedasticity(full[['sat_score', x]]), pg.multivariate_normality(full[['sat_score', x]]))
print('\n')
Unsurprisingly, all of our features fail the tests for bivariate normality and homoscedasticity. This is not uncommon with real world data, which is rarely a perfect fit for theoretical statistical models out of the box. Fortunately, there are still ways to analyze correlation strength and significance within our dataset. Two of the most common ways are to use non-parametric statistical tests (tests that make fewer assumptions on the structure/distribution of data) or to transform the data using a power transform, such as a Box-Cox transformation to stabalize variance and bring the distribution closer to normal.
For our purposes, using a non-parametric test for analyzing our numeric features will be more straightforward than using power transforms. Since we know that several (if not all) of our features have outliers, we want to use a test that — unlike Pearson's r — is robust to such outliers. A relatively new test called Shepherd's pi has been shown to be more robust to bivariate outliers than Pearson's r, Spearman's rho and skipped correlation (r'). Once again, we can easily conduct this test on our data using Pingouin's pairwise_corr
function.
shep_corr_neg = pg.pairwise_corr(full, columns=[['sat_score'], neg_feats], method='shepherd')
shep_corr_pos = pg.pairwise_corr(full, columns=[['sat_score'], pos_feats], method='shepherd')
all_shep_corrs = pd.concat([shep_corr_neg, shep_corr_pos]).sort_values(by='r').reset_index().drop(columns=['index'])
all_shep_corrs['p_doubled'] = all_shep_corrs['p-unc'].apply(lambda x: 1 if x*2 > 1 else x*2) # p-val x 2 to account for removing outliers
all_shep_corrs[['X', 'Y', 'outliers', 'r', 'CI95%', 'p-unc', 'p_doubled']]
As we can see, the Shepherd's pi correlation identified a number of outliers in each of the features we analyzed. However, even after removing those outliers from the analysis, all 12 of our features have a statistically significant correlation (p_doubled < 0.05), including those with the weakest correlation (black_per
, hispanic_per
and pct_AP
).
To visualize how the Shepherd's pi correlation flags outliers, we can take a look at the Pingouin source code and use the shepherd
function to create columns in our full
dataset that correspond to the outliers. The code below does the following:
- Imports underlying
shepherd
function from Pingouin - Loops through each of our top correlating features, passing each through
shepherd()
and saving the list of outliers as a new column in thefull
dataset - Creates a list of the outlier column names
- Plots each regplot using our
graph_corrs function
, where the list of outlier column names is used to create a custom color map for the scatter plots: red indicates outliers, blue for everything else.
from pingouin.correlation import shepherd
outlier_columns = []
for feat in all_shep_corrs['Y']:
shep_r, shep_pval, outliers = shepherd(full[feat], full['sat_score'])
full['{}_outliers'.format(feat)] = outliers
outlier_columns.append('{}_outliers'.format(feat))
# First 10 rows of outlier info
full[outlier_columns].head(10)
graph_corrs(full, all_shep_corrs, coef='pi', outliers=outlier_columns)
Correlation to District and Borough¶
We began this notebook by using maps and boxplots to view correlations between sat_score
and both District
and borough
, so it is only fitting to end by looking at the significance of those correlations. However, boxplots and maps alone can't indicate statistical significance. Because our District
and borough
columns are categorical rather than numerical, we cannot use the same Shepherd's pi method we used for our other features. Instead we will need to do an analysis of variance (ANOVA) test, which determines whether the difference between the means of three or more groups are statistically significant from one another.
Although the traditional ANOVA (or F-test) is robust against the assumption of normal distributions in our data, much like Pearson's correlation it is unreliable when the groups of data have unequal variances. Fortunately, another ANOVA — Welch's ANOVA — is not sensitive to unequal variances and actually tends to be more accurate than the traditional ANOVA in most circumstances.
For our analysis, we will use Welch's ANOVA to test whether there is a significant difference between SAT scores in any two districts or boroughs, followed by the Games-Howell post-hoc test to determine pair(s) are significantly different from one another.
Our null hypothesis is that there are no significant differences between the groups' means: $H_0 : \mu_1 = \mu_2 = \mu_3 \cdots \mu_k$, where $\mu$ = group mean and $k$ = number of groups.
Our alternative hypothesis is that there is a significant difference between group means: $H_a : \mu_i \neq \mu_j$, where $\mu_i$ and $\mu_j$ can be the mean of any group.
If there is at least one group with a significant difference from another group (p < 0.05), we will reject our null hypothesis.
Using Pingouin's welch_anova()
function to analyze both districts and boroughs, we see that our p-values are less than 0.05 and we can thus reject the null hypothesis for both.
pg.welch_anova(dv='sat_score', between='District', data=full)
pg.welch_anova(dv='sat_score', between='borough', data=full)
When we run our post-hoc Games-Howell test on pairs of districts, we see that our p-values indicate significance mostly between pairs containing District 02. We also see that the effect size (hedges
) is very large, even for those district pairs that were not found to be significantly different. For example, District 16 vs. District 31 receieved a p-value of 0.118, even though it has an effect size of 1.9 (indicating 79.4% non-overlap between the two).
So how can we explain this pair of districts not meeting statistical significance? We simply do not have enough schools in each district to be sure for most district pairs. If we take a look at the number of schools in each district, we see that District 02 has the most (65), with District 10 coming in second (28), and 29 of our 34 districts with fewer than 20 schools. With group sizes so low, it is important to remember that a p-value greater than 0.05 doesn't necessarily mean there is no statistical significance — it means that we cannot confidently reject our null hypothesis with the data we have.
district_gh = pg.pairwise_gameshowell(data=full, dv='sat_score', between='District').sort_values(by='pval')
district_gh[district_gh['pval'] < 0.15] # Top 20 of over 500 different pairs
full['DBN'].groupby(full['District']).count()
We are, however, in luck when it comes to our boroughs. The results of the Games-Howell on borough
match up to what we see in our boxplots (re-illustrated below):
- Bronx and Brooklyn are both significantly different from Manhattan, Queens and Staten Island.
- Bronx and Brooklyn are not significantly different from each other.
- There is no significant difference between any combination of Manhattan, Queens and Staten Island.
boro_gh = pg.pairwise_gameshowell(data=full, dv='sat_score', between='borough').sort_values(by='pval')
boro_gh
f, ax = plt.subplots(figsize=(16, 12))
ax = sns.boxplot(data=full, x='sat_score', y='borough', palette='muted', showmeans=True, meanline=True)
plt.show()
all_shep_corrs[['X', 'Y', 'outliers', 'r', 'CI95%', 'p-unc', 'p_doubled']]
Results¶
Despite the fact that this data was inherently difficult to fit into existing statistical models, our analysis uncovered some excellent insights into possible contributing factors for SAT performance across New York City. Circling back to the inquiries that motivated this project, we find that significant correlations to SAT score were found in nearly every demographic we set out to explore.
- Cultural and racial differences in schools showed a strong positive relationship between White- and Asian-heavy schools, with a weaker negative relationship to primarily Black or Hispanic schools. The percentage of English Language Learners also had a moderate negative correlation to the SAT.
- The proportion of high-achieving students also had a strong positive correlation to SAT score, though more so when looking to the percentage of students who acheived an Advanced Regents diploma than those who took or excelled on AP tests. Conversely, we also saw a moderately strong negative correlation to the percentage of both Special Ed and Local diploma earners.
- For socioeconomic demographics, we saw a strong negative correlation to the percentage of students receiving free or reduced-price lunch. The city's poorest borough, the Bronx, also had the lowest mean and median SAT scores in our dataset.
- The Bronx and Brooklyn had significantly different (lower) means from the other three boroughs, with medium to large effect sizes for each pair. Districts, on the other hand showed very large effect sizes, but were mostly too small to confidently say their difference in means was significant.
Only a school's gender split proved to be virtually unrelated to its SAT score, which makes sense as it is the least intertwined of all the features we examined.
Limitations¶
Although this project paints a compelling narrative for what lies behind the scenes of a school's average SAT score, there are numerous reasons why it should not be relied on for prediction purposes and why it should reassessed in the future by other analysts:
- First, this data is nearly a decade old now. NYC Open Data hasn't released newer data on SAT results, so there is no way to know whether these schools are still performing at the level they did in 2012.
- Second, the underlying data was compiled from several different sources, which did not all contain the same schools. Rather than drop all our missing values, we imputed values that were missing for schools that were in our
sat_results
data. In doing so, we preserved our sample size at the cost of some amount of accuracy. - We also had no ideal way to measure statistical significance due to the highly varied sample sizes, data shapes and standard deviations. Although Shepherd's pi analysis was more accurate than Pearson's r, a careful look at the outliers it excluded from analysis were virtually all of the International schools — a highly informative, yet small portion of our data.
- Finally, it would be nice to include more socioeconomic demographics. Of all the features in our data, only
frl_percent
directly addressed student body affluence (or lack thereof). It is entirely possible that wealth plays a larger part than this analysis reveals, as students with means often have an advantage when it comes to test preparation.