STAT 29000: Project 1 — Spring 2021
Motivation: Extensible Markup Language or XML is a very important file format for storing structured data. Even though formats like JSON, and csv tend to be more prevalent, many, many legacy systems still use XML, and it remains an appropriate format for storing complex data. In fact, JSON and csv are quickly becoming less relevant as new formats and serialization methods like parquet and protobufs are becoming more common.
Context: In previous semesters we’ve explored XML. In this project we will refresh our skills and, rather than exploring XML in R, we will use the lxml
package in Python. This is the first project in a series of 5 projects focused on web scraping in R and Python.
Scope: python, XML
Dataset
The following questions will use the dataset found in Scholar:
/class/datamine/data/apple/health/watch_dump.xml
Resources
We realize that for many of you this is a big "jump" right into Python. Don’t worry! Python is a very intuitive language with a clean syntax. It is easy to read and write. We will do our very best to keep things as straightforward as possible, especially in the early learning stages of the class.
We will be actively updating the examples book with videos and more examples throughout the semester. Ask a question in Piazza and perhaps we will add an example straight to the book to help out.
Some potentially useful resources for the semester include:
-
The STAT 19000 projects. We are easing 19000 students into Python and will post solutions each week. It would be well worth 10 minutes to look over the questions and solutions each week.
-
Here is a decent cheat sheet that helps you quickly get an idea of how to do something you know how to do in R, in Python.
-
The Examples Book — updating daily with more examples and videos. Be sure to click on the "relevant topics" links as we try to point you to topics with examples that should be particularly useful to solve the problems we assign.
Questions
It would be well worth your time to read through the XML section of the book, as well as take the time to work through |
Question 1
A good first step when working with XML is to get an idea how your document is structured. Normally, there should be good documentation that spells this out for you, but it is good to know what to do when you don’t have the documentation. Start by finding the "root" node. What is the name of the root node of the provided dataset?
Make sure to import the
|
Here are two videos about running Python in RStudio…
…and here is a video about XML scraping in Python:
-
Python code used to solve the problem.
-
Output from running your code.
Question 2
Remember, XML can be nested. In question (1) we figured out what the root node was called. What are the names of the next "tier" of elements?
Now that we know the root node, you could use the root node name as a part of your xpath expression. |
As you may have noticed in question (1) the |
-
Python code used to solve the problem.
-
Output from running your code.
Question 3
Continue to explore each "tier" of data until there isn’t any left. Name the "full paths" of all of the "last tier" tags.
Let’s say a "last tier" tag is just a path where there are no more nested elements. For example,
|
Here are 3 of the 7 "full paths":
|
-
Python code used to solve the problem.
-
Output from running your code.
Question 4
At this point in time you may be asking yourself "but where is the data"? Depending on the structure of the XML file, the data could either be between tags like:
<some_tag>mydata</some_tag>
Or, it could be in an attribute:
<question answer="tac">What is cat spelled backwards?</question>
Collect the "ActivitySummary" data, and convert the list of dicts to a pandas
DataFrame. The following is an example of converting a list of dicts to a pandas
DataFrame called myDF
:
import pandas as pd
list_of_dicts = []
list_of_dicts.append({'columnA': 1, 'columnB': 2})
list_of_dicts.append({'columnB': 4, 'columnA': 1})
myDF = pd.DataFrame(list_of_dicts)
It is important to note that an element’s "attrib" attribute looks and feels like a
|
-
Python code used to solve the problem.
-
Output from running your code.
Question 5
pandas
is a Python package that provides the DataFrame and Series classes. A DataFrame is very similar to a data.frame in R and can be used to manipulate the data within very easily. A Series is the class that handles a single column of a DataFrame. Go through the pandas
in 10 minutes page from the official documentation. Sort, find, and print the top 5 rows of data based on the "activeEnergyBurned" column.
-
Python code used to solve the problem.
-
Output from running your code.