Category Archives: Data

A guide to SCOTUS Search

Guide version 1.0 — February 18, 2015

Last Wednesday I posted an intro note to SCOTUS Search: the free, searchable online database of United States Supreme Court oral argument transcripts that Victoria Kwan and I just launched in beta. The post recounted the development of the idea behind SCOTUS Search, as well as some plans for the project going forward.

Now that the site has seen some traffic (which is extremely exciting!), I figured it would be worthwhile to put together a short guide with some tips on how to best use the site, some caution about its exhaustiveness, and various other marginalia. This post is likely to be updated over time as more things come to mind.

Before I say anything else, though: thank you so much for checking it out! This is a project Victoria and I have been working on, on and off, for the better part of a year now, so it’s really gratifying to see people making use of the site and tweeting out their favorite search results and obscure judicial references. I can’t wait to see what legal writers, academics, journalists, and Court-watchers do with this data going forward.

So, in no particular order:

  • The first thing I must emphasize again, as I did in the intro post (and as is displayed prominently on the SCOTUS Search home page), is that SCOTUS Search is still in beta. What does this mean in practice? A lot of things, actually:
    1. The database of oral argument transcripts is neither exhaustive nor 100% error-free. I don’t mean this to be alarming in any way, but just as a fair warning. As Oyez notes, the Supreme Court only “installed an audio recording system in 1955.” (You can see a visual representation of this lack of transcripts prior to 1955 in the graph displayed on the SCOTUS Search home page.) While Oyez has compiled a truly astounding library of transcripts, there are still many blank cases from 1955 onward that we have therefore been unable to include in SCOTUS Search — as our only sources for transcripts so far are Oyez and the Supreme Court itself. Moreover, as the above link makes clear, the official recordings have endured various hiccups over the subsequent decades that had an impact on transcribers’ ability to ensure perfect quality at times.
    2. For example, in many cases, justices and attorneys are not identified by name in the transcripts and are referred to, instead, as “Unidentified Justice” or “Unknown Speaker.” In other cases, the same speaker is identified differently across cases: “Justice Scalia” and “Justice Antonin Scalia,” for example. Elsewhere, we found examples of misidentification, as when John Roberts was referred to in one transcript as “Chief Justice John Roberts” even though the case was argued prior to his appointment in OT 2005 and Roberts was actually appearing as an attorney arguing before the Supreme Court at the time. Finally, there are also straight-up typos, as pointed out here and here, for example. (Speaking of which…please let us know whenever you find any errors!)
    3. We have attempted to correct as many of these ambiguities and errors as possible. But given the scale of the data, we expect to find hundreds or even thousands of similar examples in various other cases. In the near future, I hope to add an “error correction” form so that registered users can submit changes to transcripts, which we can then review and approve to ensure high accuracy.
  • A lot of you who visited via a link in the Twitter mobile app probably already noticed this, but…SCOTUS Search does not currently play nice with mobile. (Not sure about tablets, as neither Victoria nor I own an iPad and haven’t tested on one yet.) I absolutely plan to add mobile functionality, but I don’t have a specific ETA just yet.
  • There are a lot of “search type” options — eight, to be precise. All of them are case-insensitive: your capitalization, or lack thereof, doesn’t matter at all. But they are super sensitive to spelling, typos, spaces, and so on. E.g. A search for “Superman” ≠ “Super man”. This is another weakness I plan on addressing in the future. Anyway, for most people’s purposes, the three most useful search types will be:
    1. Oral argument: Exact phrase. This search type works exactly as advertised: for example, typing “in my underwear” (without quotes!) will bring you to the sole result for a very confusing, and confused, rumination on bullying and the frailty of human memory by Justice Stephen Breyer. As of today (2/18/2015), using quotation marks with this search type will only return results that actually include quotation marks in the transcript text. Assuming that’s not what you’re looking for, don’t use quotation marks when selecting the “Oral argument: Exact phrase” search type.
    2. Oral argument: All search words. This is very similar to the above search type, except the words in the phrase don’t have to be adjacent to each other in the transcript text. If you type, “baseball hockey,” for example, the results will return all statements containing both words, whether or not they were said immediately consecutively.
    3. Oral argument: Any search words. This will return any statement containing any of the words in the search box.
  • Sign up as a user! You don’t have to do it to use SCOTUS Search, but here are some of the benefits:
    1. It’s free.
    2. You get to write notes on individual cases and statements, as well as favoriting them (for bookmarking purposes). You can even decide whether to make your notes private (viewable only to yourself, which is the default) or public (which can be viewed by any other registered users), and you can look at other users’ public notes as well.
    3. You can export the case titles and metadata of search results (to CSV or XLS format), instead of simply viewing them on the site.
    4. You can save all your searches and set your default search type.
    5. You can receive email alerts any time a case transcript is added or updated (and, as an added bonus, the emails let you know when SCOTUS Map — our sister project — has been updated too).
    6. You get to set your own time zone preferences! Which is, I guess, pretty cool.

Thanks again for checking it out!


Up next: SCOTUS Search


A little over six months ago, I wrote a short blog post called “Introducing SCOTUS Map.” In the time since, the project has really grown up, entirely due to Victoria‘s relentless research and updates. SCOTUS Map now displays more than 150 events spanning from last summer to this upcoming one, along with links to registration information, transcripts, audio, and video (where available).

Of late, we’ve added new features as well: there are seven default views to choose from (including “Summer 2014,” “2014 Term,” “Summer 2015,” “Future Events,” and so on), the sidebar can be hidden to enlarge the map, and — as of this week — visitors can now subscribe to daily or weekly email alerts in order to receive updates any time new events are added. (If no new events come through that day or week, don’t worry: we won’t send you an email.)

But believe it or not, SCOTUS Map wasn’t the first Supreme Court project Victoria and I had started. Back in April of last year, three months prior to SCOTUS Map’s launch, we took the first steps towards building the first free, searchable online database of Supreme Court oral argument transcripts.

Currently there are two principal repositories of freely available Supreme Court oral argument transcripts. The first is the recently redesigned Supreme Court web site, which hosts transcripts dating back to the 2000 term. The second, and far more exhaustive, resource is, which holds oral argument transcripts dating back to the 1950s.

The idea for SCOTUS Search had first come up in this context early last year: Victoria was writing pieces on the Supreme Court for my blog and needed to delve into the oral argument proceedings in order to conduct research. While she could usually locate a specific transcript on either or, each one would have to be searched individually. So if, for example, she was looking for all mentions of “gay marriage” before the Court, she’d have to open every single case that had ever been argued over the past decade or two.

This was clearly an impossible task. Making matters worse, the Supreme Court’s hosted transcripts are stored in PDF format, which — while searchable on an individual basis — are not conducive to automated bulk searching across documents. Oyez boasted a much larger library of transcripts in plain-text, which was far superior from a technical standpoint. However, the site had no full-text transcript search engine, meaning that searching for words or phrases would still require manually opening hundreds or thousands of cases. Additionally, some transcripts were missing and others appeared to cut off partway through.

Starting in 2013, Victoria mentioned to me on numerous occasions her frustrations with the arduous research process. And thus an idea was eventually born last year: if we could somehow consolidate Supreme Court oral argument transcripts across sources and standardize them into a database, we could make the full texts searchable online for free, for the very first time.

Over nine months later, the result of this project is Containing over 1.4 million individual statements spoken in nearly 6,700 Supreme Court oral arguments from the 1950s through the present, the site allows users to search the full text of oral argument transcripts using search options that include filters for speaker and Court term. SCOTUS Search is still in beta, so there are doubtless errors and bugs that we’ll discover over time. In fact, we hope that new visitors to the site will help us out in this regard: if something isn’t working or doesn’t make sense, please let us know so we can fix it.

The recommended way to start is to sign up for a free login. This isn’t required in order to search through transcripts, but there are a lot of features which are only available to registered users: adding notes to cases and individual statements (and sharing them with other users, if you prefer), saving your search history, and marking cases and statements as favorites, for example.

An example search result page.

We’re also planning on adding even more substantial tools for registered users only, including the ability to submit transcript revisions and error/typo fixes where applicable. My long-term wish list includes expanding SCOTUS Search beyond the Supreme Court, to incorporate oral arguments from the federal appeals courts (and perhaps international courts). Imagine being able to trace the thought process and rhetoric of Supreme Court justices back to their days on lower appeals courts, or doing the same with attorneys who have argued before multiple courts. In short, the launch of SCOTUS Search is just the beginning of the road, not the end. There’s plenty more to come.

Finally, it cannot be stated clearly enough what a debt this project owes both to the Supreme Court, for hosting over a decade of transcripts, and especially to Oyez, whose tireless transcription and metadata compilation over the years has proved invaluable to many a researcher and journalist, and whose extensive library of transcripts made SCOTUS Search possible.

So take a look when you get the chance, and let us know what you think! Also, don’t forget to follow us on Twitter.

Thank you!

Today in Data: Basically, I feel like it’s pretty much terrible out there

This morning I was reading New York Times architecture critic Michael Kimmelman’s mostly scathing review of the design of the brand-new 1 World Trade Center, when I came across this passage:

Like the corporate campus and plaza it shares, 1 World Trade speaks volumes about political opportunism, outmoded thinking and upside-down urban priorities. It’s what happens when a commercial developer is pretty much handed the keys to the castle. Tourists will soon flock to the top of the building, and tenants will fill it up. But a skyscraper doesn’t just occupy its own plot of land. Even a tower with an outsize claim on the civic soul needs to be more than tall and shiny.

Emphasis mine. I’ve always hated the term “pretty much,” at least when used in a newspaper article, and I’ve noticed it appearing more and more of late:

So I decided to compare it to a couple other terms whose common denominator is their collective insistence on near-meaninglessness:

Note: The tool I used to obtain these figures, Chronicle by NYTLabs,  doesn’t differentiate between words/phrases found within direct quotes versus those penned by the reporter him/herself. In the excerpt I quoted above, Kimmelman used the phrase “pretty much” himself, but certainly a portion of the increased usage of these terms in recent years is due to their inclusion in direct quotations, which is (to some extent, anyway) more forgivable from a writing perspective.

Today in Data: A Month in the Life of CitiBike #18068

CitiBike has proven to be quite the hit in New York. As of mid-November, CitiBike claimed that riders had taken 14,589,242 trips since the service launched in May 2013. With approximately 330 docking stations and 6,000 bikes in circulation, that’s a lot of wear and tear on each bike.

In the spirit of finding out just how much of a workout these bikes get, I pulled the latest full month of available bike trip logs from CitiBike’s site, which happens to be August 2014. I sorted by bike ID to determine which specific bike was ridden the most times that month.

This led me to CitiBike #18068, a stalwart two-wheeler with 349 individual trips taken in August — over 11 rides per day. (5,958 unique bikes were ridden a total of 963,489 times in August, for an average of 162 trips per bike.) Using the GeoJSON geographic data convention, I was able to map all of these trips by plotting the starting and ending bike stations on Google Maps:

Today in Data: NYC Restaurant Inspections

New York City, like an increasing number of American metropolises, boasts a decently impressive Open Data web site. A lot of the tables are out of date or otherwise useless, but some of them are pretty cool and are updated fairly often.

Tonight, as a hobby while allowing my digestive system to process copious amounts of turkey, stuffing, and chocolate cheesecake, I downloaded the restaurant inspection data table, which contains over half a million records of inspections within the city over the past several years.

I began by whittling down the dataset to include only inspections that took place this year. Then I removed all inspections that didn’t have a borough field filled out (Bronx, Manhattan, Queens, Brooklyn, or Staten Island), as well as removing all rows with anything other than A, B, or C in the field for letter grade. Finally, I filtered out all but the most recent inspection for each establishment — so if a particular diner, for example, was inspected more times due to its uncleanliness (this is official policy), I only included the last one.

This left a final count of 22,105 restaurant inspections in 2014 alone — only the last one conducted for each establishment, and only for inspections resulting in a letter grade of A, B, or C and associated with one of the five boroughs.

First, I checked to see whether any discrepancies existed among the letter grades awarded to restaurants in the various boroughs:

Here’s the same chart in percentage format:

Interestingly, where I began to see a divergence was when I checked grades by month, rather than by borough:

In the winter months (January through March), as well as so far this November, A grades constituted over 90% of all final inspections. From April to October, however, that ratio hovered anywhere from just under 83% to just under 90%. Of course, I don’t have the December numbers for this year yet (time travel has yet to be invented — unless, of course, it’s already happened in the future), but I’d assume it would follow the same general trend: fewer A grades in the summer, more in the winter.

To delve further into this hypothesis, I filtered out all A grades and sorted the remaining 2,648 Bs and Cs by their most common violation descriptions. Here are the top 10:

The top violation is storing food at temperatures that are too high, something that would occur most frequently in the summer months. And indeed, 272 of the 406 total counts of this violation (67%) took place in the four-month period from June to September 2014, for a monthly average of 68 counts. By contrast, from January through May, restaurants were only cited for this violation on a total of 94 occasions, or fewer than 19 times per month.

Indeed, of the five top violations reported in inspections resulting in B and C grades, four involve either overly-hot food or (the potential for) infestation by rodents, flies, and so forth — in other words, classic summer problems.

One final thing you might not expect: the cleanest bill of health given to a specific type of cuisine was for…donut shops. So you’ll know where to find me in the next few weeks:

That’s it for now. Feel free to send me more ideas for how to parse this data, and I may continue this series with other datasets as well.