Computational Journalism, Spring 2016

Agenda

Final Project

Final requirements and deadlines

  • README.md due Monday, May 30
  • Last class on June 1 to work on projects and get feedback
  • Deployed project due Tuesday, June 7

If you need help with deployment, ask me before Tuesday.

Here’s a sample app:

Last week

How weird is your household? In-class critique, pitting the New York Times data team vs a former Apple designer.

Data and some lessons here: congress-data-taster

Here’s something you should be able to do with the APIs: Find the least partisan congressmembers by voting record.

May 16: Congressional data

In class, consider how money spent is a proxy for competitive races.

Explore the FEC Expenditures with Python and answer these questions:

  • How much independent expenditure money was spent on Trump vs opposing Trump? How about Clinton?
  • Which candidates had the most oppo money spent against them?
  • How much money was spent on Facebook advertising in 2014 vs 2010?

Then, apply the same process to candidate disbursement data. Pay particular attention to how different the fields are for independent expenditures.

For Wednesday: We’ll build an app in class. Sign up for the following APIs:

Homework

And before class (i.e. Tuesday night), email me with 10 interesting mashups of Congress data.

Last week, May 9

Homework: (due next Monday and Wednesday) - Mini-Project: A Flask App That Filters

Topic: The work of ranking and filtering data

Compare how the different organizations rank and filter data, and the pros and minuses:

College ranking sites

College Score Card

US News Best Colleges


Last time, May 9

Deploying a web application on Heroku

Lessons

Homework due Wednesday, May 11

Readings

Deploy your own Flask news app

Send me two deliverables:

  • a URL to your live app on Heroku
  • Create a new Github repo (do not put it in cj-2016) named myfirstnewsapp that contains the code to your Flask app.

Build a Flask app similar to the news app described in NICAR’s First News App tutorial. Deploy it to Heroku.

Warning: the First News App tutorial contains a number of unnecessary steps that you don’t need to follow. It also does not contain any of the steps needed to get on to Heroku.

  • You should be able to skip most of the installation instructions.
  • Skip the instructions involving virtualenv and/or git – we don’t need the former, and you already know how to do the latter from the other tutorials.
  • Skip Act 5: Hello Internet, because it describes an alternate way of deploying to the Internet.

You don’t have to use the LA Riots data, but I want you to be able to create an app that has at least these components:

Note: while we haven’t covered JavaScript explicitly, you should be able to create an interactive JavaScript map by following the tutorial and making adjustments as needed.

FWIW, here’s my hot take on the First News App. The main change is that I use the Google Street View API to display a picture of the incident address.

A note about including external files

When you get to the Hello, JavaScript section of the First News App tutorial, it will have some example HTML for including the external JavaScript and style files for its Leaflet interactive maps:

  <link rel="stylesheet" href="http://cdn.leafletjs.com/leaflet/v0.7.7/leaflet.css" />
  <script src="http://cdn.leafletjs.com/leaflet/v0.7.7/leaflet.js"></script>

Don’t copy that code – it will be non-functional on a Heroku site, because Heroku uses https and browsers, such as Chrome, will not allow the importing of files from http (non-secure) URLs.

Instead, include this HTML snippet – it will import the same Leaflet code, but will actually work on Heroku:

  <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/leaflet/0.7.7/leaflet.css" />
   <script src="https://cdnjs.cloudflare.com/ajax/libs/leaflet/0.7.7/leaflet.js"></script>

Last week, May 4

Continuing with the wrangling and visualization with pandas and matplotlib

Relevant repos:

  • tweepy-congress-collector - a repo containing code to fetch data from the Twitter API and use pandas to join and analyze the data. Also contains all the actual data fetched (300MB)
  • python-notebooks-data-wrangling - this dataset contains several notebooks and several hundreds of MB of data. You can follow the lessons on how to fetch the data yourself, or clone the repo and access the data locally. Some lessons:

    Last week, May 2

    Visualization with matplotlib; data wrangling with pandas

    Clone the repo here: https://github.com/datademofun/matplotlibsampler and work your way through the lessons.

    Homework (due May 4): "3-charts"

    In your cj-2016 repo, create a folder named 3-charts

    It should contain three charts:

  • A line chart comparing the stock performance of various tech companies (see the data/stocks directory in the matplotlibsampler repo
  • A scatterplot showing the relationship between two independent variables. Check out the data/schools or data/congress directory. Easiest example is: are high SAT reading scores related to high SAT math scores? (duh) You don’t have to join two different datasets.
  • A stacked bar chart with categorical variables (any of the data files will work, but data/congress might be easiest). An example: Number of twitter followers by congressional party and gender

Your cj-2016/3-charts folder should contained saved images of the charts (i.e. .png files) and the code to generate those files.

Readings

Also, please read:


Last Week, April 27

Pre-built Flask apps to fork, clone, and improve

Worth reading on your own: What I Learned About the Washington Post From Four Years Collecting Data on Police Violence

Monday, April 25

Wednesday, April 20: Texas Web Scraping

Previous class

Monday, April 18: guest speaker

David Yanofsky of Quartz will talk about his entrepreneurial work in data visualization and investigations.

Homework due April 20: Scrape and Count Webpages

Going to pivot to web scraping and HTML parsing. First lesson is here:

Collect the lists of White House press briefings

In class exercise on building a Flask app

Finish these series of exercises, make sure you can produce a simple Flask app:

Introduction to Simple Web Applications with Flask

We'll tentatively start on this series of Flask app building

Introduction to Building Web Applications from Data

Last Week

  • April 4 Lecture and Homework
  • Dollars for Docs prehistory
  • Two Weeks Ago

    Practice examining NYPD Stop and Frisk Data using interactive Python.

    COMM 177A/277A

    Focuses on using data and algorithms to lower the cost of discovering stories or telling stories in more engaging and personalized ways. Project based assignments based on real-world challenges faced in newsrooms. Prior experience in journalism or computational thinking helpful. Prerequisite: Comm 273D, COMM 113/213, or the consent of instructor.

    Instructor

    Dan Nguyen, dun@stanford.edu

    Meeting times

    Office hours
    • Mondays and Wednesdays, 1PM to 3 PM, or by appointment
    • McClatchy Hall 342

    Objectives

    Grading
    • Attendance: 10%
    • Homework: 50%
    • Projects: 40%
    • There is no final.
    • Please let me know several days in advance if you cannot make class.
    • There is a final project that will consist of a public-facing web application. Here's a nice example from a student last year.
    • There will be 2 smaller projects, some of which will be worked on in-class and in groups.
    • There will be readings/case studies every week.
    • There will be challenges every week.

    Books and Resources

    There are no required books, but I'll likely make frequent references to:

    We'll be using Python 3.5 and Github. You should be using a text editor for writing your programs: either Sublime Text 3 (3, not 2) or Atom will do.

    Syllabus

    Week 1

    March 28

    Homework

    Due on Monday, via today's lesson plan:

    Practice examining NYPD Stop and Frisk Data using interactive Python.

    Week 2 - Text and Visualizing Text

    April 4 / April 6

    April 4 Lecture and Homework

    We'll continue reviewing the Python programming fundamentals, in the service of deserializing text into data structures, and, when necessary, turning data structures into text files, particularly formatted as CSV and JSON.

    Week 3 - Filtering noise / Web scraping

    April 11 / April 13

    Our problem is not lack of information. It's lack of attention span. Data is not much good to us if we can't sort it the way we need it to be sorted. Hence, the need to scrape webpages and PDFs.

    We'll use ProPublica's Dollars for Docs as a case study.

    By now, we'll have written a fair amount of HTML. Web-scraping generally involves learning one more kind of text parser, such as lxml or BeautifulSoup, and writing the automated logic to navigate a website.

    Week 4 - APIs

    April 18 / April 20

    David Yanofsky of Quartz will talk about his entrepreneurial work in data visualization and investigations.

    Building a better Recalls site: studying the Recalls dataset.

    A walkthrough of HTML scraping and regexes

    Introduction to Simple News Apps based on CSPC Recall Data

    Homework: Build out the Recalls app as far as making it a table and adding product images.

    Week 5 - Intermediate Flask App construction

    April 25 / April 27

    Building multi-page Flask apps; Examples:

    Week 6 - Data Visualization

    May 2 / May 4

    Studying both the technique and theory of effective data visualization, and how to use Python's matplotlib to efficiently produce charts.

    Readings



    Week 7 - News application critiques, Application Deployment

    May 9 / May 11

    Contrast/compare examples of real-world news applications and data portals, including ProPublica's Represent and Socrata.

    Learn how to deploy a basic app to Heroku (and maybe AWS).

    Steps:


    Week 8 - Congressional and other Public Data

    May 16 / May 18

    Study APIs and datasets focused on U.S. Congress, including:

    Readings

    The Itemizer (thescoop.org) by Derek Willis:

    Why he made it:

    There’s one thing that has always bugged me about how we reference campaign finance data online: the best that most of us can do when we link to a campaign filing is to link to a particular page, whether that’s a list of contributors or a summary page. Yet often we’re referencing a single transaction or line-item.

    via Derek Willis: The Data-Driven Congressional Reporter (thescoop.org)

    Maybe you don’t have time to read the Record every day; wouldn’t it be great if you could set some simple rules for things of interest and have a computer do it for you? Wouldn’t it make sense that a computer could find the exception to the rule among a series of House votes that occurred while you were out interviewing people?

    Here are some screenshots from the NYT's internal Congress app that give an idea of the "views" into Congressional voting data that is interesting to New York Times political reporters:

    img

    img

    img


    Week 9 - News application critique and deployment (continued)

    May 23 / May 25

    Week 10 - Project work week

    May 30 / June 1

    In-class time to work on and share projects.