TDM 10200: Project 11 — Spring 2023
Dataset(s)
'/anvil/projects/tdm/data/fars'
ONE
Go ahead and list what is in /anvil/projects/tdm/data/fars
. You do not need to read in all of the files. The fars
directory contains years including 1975-2017. Each year also contains at least 3 CSV files. The one that we will be looking at is the ACCIDENTS.CSV
-
List what files are in the year 1985
-
Read in the
ACCIDENTS.CSV
and then go ahead and change the values in theYEAR
column from two digits to four digits. For example, we should change89
to1989
. Do this by adding a19
to each year value. -
Now combine the
MONTH
,DAY
,YEAR
columns into a new column calledDATE
Helpful Hint (for b)
We can append strings to every value in a column by first converting the column to str
using astype
then use the +
operator:
myDF["myCol"].astype(str) + "appending_this_string"
-
append in coding takes an object and adds it to an existing list
Helpful Hint (for c)
If you see the numbers 99 or 9 it is an indicator that the information is unknown. If you want to learn more see here
-
Answers to the questions a,b,c above.
-
Code used to solve this problem.
-
Output from running the code.
TWO
What we want to do now is create a Dataframe called accidents
that joins the ACCIDENT.CSV
files from the years 1985-1989 (inclusive) into one large Dataframe.
Insider Knowledge
The Pandas
library has three main functions that combine data.
merge() is typically used for combining data based on common columns or indices. Merge is similar to the join function in SQL. Important to note that merge() will default to an inner join unless specified.
join() is typically used for combining data based on a key column or an index.
concat() is typically used for combining Dataframes across rows or columns.
There are several different forms of joins
we will just discuss two here.
-
inner-will return only matching rows from the tables, you will lose the rows that do not have a match in the other Dataframe’s key column.
-
outer- will return every row from both the left and right dataset. If the left dataset does not have a value for a specific row it will be left empty rather than the entire row be removed same goes for the right dataset
A great visual can be found here
-
Answers to the question above
-
Code used to solve this problem
-
Output from running the code.
THREE
Using the new accidents
Dataframe that you just created, let’s take a look at some of the data.
-
Change the values in the
YEAR
column from a 2 digit year to a 4 digit year, like we did in the last question, but using a different method. -
How many accidents are there in which one or more drunk drivers were involved in an accident with a school bus?
Helpful Hint (for a)
use the to_datetime
function
df[''] = pd.to_datetime(df[''], format='%y').dt.strftime('%Y')
Helpful Hint (for b)
look at the specifically the variables DRUNK_DR
and SCH_BUS
-
Answers to the two questions
-
Code used to solve this problem.
-
Output from running the code.
FOUR
-
Find how many accidents happen in total per year between 1 or more drunk drivers and school bus.
-
what year had the lowest number of accidents
-
what year had the most number of accidents
-
-
Now we want to consider which days of the week had the most accidents occur
-
Is there a time of day where you see more accidents? Using 12am-6am/ 6am-12pm/ 12pm-6pm/ 6pm-12am as your time frames.
-
Answers to the 3 questions above
-
Code used to solve this problem.
-
Output from running the code.
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. In addition, please review our submission guidelines before submitting your project. |