22 October 2011

Scrappy Scapers

In an earlier post I presented some R code for a basic way of collecting text from websites. This is a good place to start for collecting text for use in text analysis. 


However, it clearly has some limitations; 
  • You need to have all of the URLs already stored in a .csv file.
  • The method of extracting the text from the downloaded HTML code using <gsub> is a bit imprecise. It doesn't remove the text from common links such as "Home" or "About".
Both of these problems can be solved in R with a bit of work. But I think for bigger scrapping projects it is probably a good idea to use other languages such as Python or Ruby

ProPublica has an excellent little series on scraping that covers how to gather data from online databases and PDFs. This is a really good public service and enables something sadly unusual in journalism: reproducibility. Their chapter on using Ruby and Nokogiri for scraping the Phizer's doctor payments disclosure database is particularly helpful. 

Building on this, I'm thinking of putting together a slideshow for how to use Ruby, Nokogiri, and Mechanize to scrap the Congressional Records database. It will be similar to the slideshow I made for how to use the googleVis and WDI packages to make Google Motion Charts. 

Bit busy over the next few weeks, but now that I've blogged it, it's in my "Must-Do" list.

19 October 2011

Even More Reason To Pay Attention

Remembering back a few posts, I discussed how it looked like a number of US financial regulators and the Departement of Justice seemed to be credibly committing to bad supervision.

This is especially worrying given this recent summary of how Dodd-Frank limits the powers of the Fed/Treasury/FDIC to respond to financial crisis. Though the idea may be to limit moral hazard by credibly committing to not give 2008-style bailouts, I have a hard time believing in this credibility. My initial thought is that no democratically elected government would actually not respond if their economy was collapsing because of a financial crisis. So, if a major crisis hits, these Dodd-Frank provisions will merely slow down the inevitable bailouts (may of the powers can be enacted with congressional approval). There is still moral hazard feeding potential crises, but crises responses will be slower.

As the Economist rightly points out, regulators have even more imperative to prevent a crisis. But to do this they need to know what is going on. They should not be weakening their independent supervisory power.

18 October 2011

Incredible

Just researching the policymaking behind the Irish 2008 "Guarantee Everything" policy and found this nugget. In the one page statement announcing the plan they cite the "international market" turmoil twice as the cause of the 2008 crisis in Ireland (US subprime induced credit crunch -> tightening liquidity markets, yada yada yada).

Not once is the massive domestic real estate bubble mentioned! Sure this doesn't reveal policymakers' total knowledge (they could just not mention the problem, while knowing it exists), but still.

Automated Academics

This WSJ piece on the US income gains over the past decade (summary: unless you have a PhD or MD, you didn't have any income gains) got me thinking:

I'm actually pretty cautious about that number, I would be more interested in the range of the distribution, I think the percent change is being pulled up by all of those physics PhDs who went into finance.

Then again, considering in the that over the past few weeks I've been learning how to automate the collection of data that used to be done by people with masters degrees, maybe PhDs are going to be the ones who automate all of the former undergraduate and masters level work out of existence, keeping the productivity gains for ourselves (conditional on the tax structure). (see also Farhod Manjoo's recent series on this issue in Slate.)

One thing I gleaned from a talk given by the Governor of the California Board of Education last night was that academics largely doesn't even need PhDs (at least at all levels except the very top). You can just have a few PhDs design standardised courses and then have them administered by less trained people. This has already been happening in K-12 education, but it gets even better in higher education, as many for-profits already know.

Since adults will complete their work electronically without as much oversight as children need you can cut out much of the administration. Monitoring instructor performance--assessing how well students are completing the standardised work--can be fully automated and done in real-time.

Conclusion: like in much of the rest of the economy, you have a few highly trained people who organise the system and everyone else just implements it with minimal need for training. The former captures most of the productivity gains.

Implication: do well in school. . . no, not just ok, but very good. Also, do well in something that allows you to design larger processes, rather than just implementing an established routine.

15 October 2011

World Bank Visualizations with googleVis

Building on the last post: I just put together a short slideshow explaining how to use R to create Google Motion Charts with World Bank data. It uses the packages googleVis and WDI. It mostly builds on the example from Mage's Blog post. I just simplified it with the WDI package (and used national finance related variables).

13 October 2011

Simple Text Web Crawler

I put together a simple web crawler for R. It's useful if you are doing any text analysis and need to make .txt files from webpages. If you have a data frame of URLs it will cycle through them and grab all the websites. It strips out the HTML code. Then it saves each webpage as an individual text file.

Thanks to Rex Douglass, also.

 Enjoy (and please feel free to improve)

Recommended -- Mid-October

Here are three articles that I've found pretty interesting over the past few days:

Finance:

A fairly insightful blog post about the changing view of management, share holders, and corporate cash.

Journalism:

The Guardian sticks it to Murdoch, again.

Science:

This is a great article on symmetry in physics. The highly speculative ending is at the very least fun. I hadn't really known much about symmetry and larger Group Theory until reading Alexander Masters' excellent biography of the eccentric mathematician Simon Norton the other day. Also highly recommended.

9 October 2011

Real Inflation? (Part 1)

At a recent lunch the conversation turned to how most American's real income hasn't change since the 1970s when we adjust for inflation (see here for some decent graphs). One of the people at the lunch (a person who has written considerably on monetary policy) contested this. His argument is that we are actually very bad at measuring inflation. Prices may rise, but the quality of the goods that we buy is much better now than it was in the seventies. The iPad I buy now is much better than the 1970s TV or radio or all the other things that it replaced in my life and probably cheaper than all of these things combined. On this line of reasoning, inflation is actually overestimated.

There is one obvious flaw with this argument: it misses much of the point. If we were really terrible at measuring inflation in this way, then yes maybe most peoples' income has actually increased. But the bigger issue is that the top sliver of the income distribution has made steady gains since the 1970s even using this potentially underestimated measure of inflation. If we are underestimating the gains for most people we are also underestimating the top part of the distribution's large gains as well. Reinforcing the point.

Ok, but what about this idea that we underestimate inflation because we have a difficult time correcting for improvements in the goods that people buy. Maybe there is something to this, which I'll follow up on later. . .