Explore redaction X-rays, decentralized storage, archive n-grams and more new Add-Ons

Explore redaction X-rays, decentralized storage, archive n-grams and more new Add-Ons

Join us May 27 for a virtual Add-On hackathon to help analyze, scrape and visualize even more

Written by

Today we announced support from the Filecoin Foundation for the Decentralized Web to build approaches to obtaining and preserving the most important documents for understanding our world.

We’re also gearing up to give a range of organizations with $400,000 in support over the next two years as well as direct technical assistance for your projects to help make that vision a reality, but we already have an early look at what’s to come today with the launch of a range of new Add-Ons that make DocumentCloud more useful while providing a template for building out your own new features and functionality.

We’re also going to be hosting a virtual hackathon to explore ways to push forward key transparency tools and approaches.

If you’re interested in better ways to extract data from PDFs, integrating powerful libraries into DocumentCloud or helping build a flexible scraping system that can automatically ingest and analyze documents as they are posted to agency websites, please register to join us. You are welcome to join for part or all of the day, and we’ll use a mix of chat and video to work through some exciting projects. Update: Due to timing and promotion issues, we’ve decided to hold off on the day of Add-On work and instead have a smaller user feedback/discussion session. We’d love you to join us May 26th at 2pm Eastern for an hour if available — register here instead.

Key to our vision is making it easier for more people to do more with primary source materials whether that’s offering new ways of analyzing them, offering AI and machine learning tools, or simplified approaches to monitoring and scraping government websites.

New PDF superpowers, right within DocumentCloud

We’re excited to share some of the early results of these efforts, thanks to a partnership with BU Spark! that brought together a team of students to help us pilot our new technologies with some wonderful results. We’ve also made it possible for any verified MuckRock account to link their Github account so you can import your own Add-Ons and explore the platform yourself. Read how to import your own Add-Ons here.

Some of the new features imported into DocumentCloud, along with source code you can examine or even fork to build your own Add-Ons:

  • Push to IPFS/Filecoin: Push the selected documents to the decentralized web, making them accessible via IPFS and Filecoin via Estuary (View source)
  • Bad Redactions: Building off the excellent X-Ray library from Free Law Project, Bad Redactions looks for instances where there are redaction fails leaving the underlying data intact. This is useful for both investigating if there’s more information than meets the eye as well as making sure you properly and fully delete information from your own uploads. Note that DocumentCloud automatically flatten pages and deletes underlying data when you use our redaction tools or force OCR. We recommend trying it on the infamous Manafort filing, which the Add-On flagged and highlighted 25 redaction errors for us during our test. You can have the Add-On leave a private annotation around the mis-redacted information or have it go ahead and properly redact it for you. (View source)
  • Metadata export: We have two new Add-Ons that export metadata from selected documents, making it easier for you to take your key-value tags, page count and much more into your favorite spreadsheet program for further analysis. (View source)
  • N-Gram Graphs: Feel like your seeing a term pop up more and more often? Now it’s easier to get validation of your hunch — this Add-On maps the occurrence of words over time you input and then compares them to each other across a given search. (View source)
  • Page Stats: Gives you basic statistics about the total length of a selection of documents, the longest document, shortest document and average pages per document. (View source)
  • User upload frequency graph: Curious whether you’re more productive during some months than others? Want to see the progress of your sharing with the public? Use this Add-On to graph your uploads over time. Tip: Put your username in as it appears in the search field (i.e., michael-morisy-658) (View source)

Feeling inspired? Join us May 27 for our virtual, distributed Add-On-a-Thon, where we’ll work to help build out other Add-Ons, answer questions, and also discuss other ways to build stronger transparency efforts at scale, from scraping and monitoring government websites to integrating with third-party libraries. Whether you want to spend the day working on a project, just a few minutes or just want updates, we’ll keep you in the loop and share more details as we get closer to the event.

We’ll be announcing more Add-Ons and how to apply for support for your ideas and work in the coming weeks, and encourage anyone who is interested in applying or learning more to register for the DocumentCloud newsletter.

For those who are eager to get started, however, we have another exciting opportunity: We’re hiring a range of operational, editorial and technology roles, starting today.

We’re currently looking for a front-end developer to help us continue to build out this exciting platform and a data journalist to join our growing editorial team to put these tools to work. If nothing’s a great fit right now, we’d still love to learn more about you and we’ll get in touch if anything is a good fit.