Connecticut COVID-19 Cases

I’ll make this one quick

I wrote a perl script to extract Connecticut's Coronavirus data, which unfortunately is in pdf format!?!? The script downloads the Coronavirus page to find the latest test data pdf (example). It does a little magic with poppler’s pdftotext, and prints out some csv with Connecticut’s towns and each of their respective number of cases. Click here for a map of the data as well as download links.

The Script

Download it here, do a quick chmod +x and run it. If you use Debian, you should install poppler-utils and perl [1]. Mac users should install poppler and perl from Homebrew.

#!/usr/bin/env perl # BSD 2 Clause License: https://opensource.org/licenses/BSD-2-Clause use v5.28; use strict; use Scalar::Util qw(looks_like_number); # the state uses bad crypto :( # https://www.ssllabs.com/ssltest/analyze.html?d=portal.ct.gov # https://weakdh.org/ my $curlcipher = "ECDHE-RSA-AES256-SHA384"; # Allow the url of the pdf to be *optionally* passed in the shell # Use this to get historical data my $pdfurl = $ARGV[0]; if (not defined $pdfurl) { my $coronapage = `curl -sS --cipher $curlcipher https://portal.ct.gov/Coronavirus`; if ($coronapage =~ m[(/-/media/Coronavirus/.*summary.*\.pdf)]) { $pdfurl = "https://portal.ct.gov$1"; } } # Remove all the "Data are preliminary" with `grep -v Data` # Remove the left-over double blank lines with `cat -s` my @coronatext = `curl -sS --cipher $curlcipher $pdfurl | pdftotext - - | grep -v Data | cat -s`; # Towns with two words have a chance of being split onto two lines, # this is used to fix that my %prefixes = map { $_ => 1 } ("New", "Old", "North", "South", "East", "West"); my @towns; my @cases; my $context = ''; my $prepend = ''; foreach (@coronatext) { chomp; next unless (/\f.*\. Towns/ or $context); if ($context eq '') { $context = 'align'; } elsif ($context eq 'align') { next unless /Town/; $context = 'towns'; } elsif ($context eq 'towns') { next if /Town/; last if (/Last/ or /\f/ or looks_like_number($_)); if ($_ eq '') { $context = 'cases'; next; } if (exists($prefixes{$_})) { $prepend = "$_ "; } else { push @towns, $prepend . $_; $prepend = ''; } } elsif ($context eq 'cases') { next if /Cases/; if ($_ eq '') { $context = 'towns'; next; } push @cases, $_; } } say "Town,Cases"; for my $i (0 .. $#towns) { say "$towns[$i],$cases[$i]"; }

Historical Data

To get historical (March) data, I wrote a small shell script to run the above on all previous published pdfs:

#!/bin/sh # BSD 2 Clause License: https://opensource.org/licenses/BSD-2-Clause # Grabs all Connecticut Coronavirus data from March URL="https://portal.ct.gov/-/media/Coronavirus/CTDPHCOVID19summary3%2020.pdf" # Data starts on the 21st seq 21 31 | xargs -I% sh -c "./fetch.pl $URL > corona-2020-03-%.csv"

Footnotes

[1]: If you don’t already have it :)