Main Page

From EEW

Project Introduction

This is an experimental Wikibase instance where I am exploring a new method for handling the data that the Environmental Data and Governance Initiative's (EDGI) Environmental Enforcement Watch (EEW) project works to make sense of. This mostly comes from the U.S. Environmental Protection Agency's (EPA) Enforcement and Compliance History Online (ECHO) system.

EEW works to make best sense of the data the EPA puts online, developing various regular analytical reports that translate what ECHO reports on the legal environmental compliance of various facilities into an accessible form. They also develop code notebooks that help other groups, such as investigative journalists, understand and work with these data.

I've been interested in how we might regularize and institutionalize this same concept for all kinds of government data where we often have, especially now under various open government and open data policies, massive amounts of potentially useful government data flowing online. Just because those data are "open" doesn't mean they are accessible and usable. We often have all kinds of obscure codes and disconnected bits of data that make perfect (maybe) sense in the context of producing and working with those data within their source agencies but make no sense to anyone else without a whole of of deciphering.

I'm exploring what that deciphering might look like if we take more of a knowledge organization approach to the whole problem rather than a data integration approach. What if we try to fit as much of that massive amount of public data into the global knowledge commons as being pursued by the Wikimedia Foundation and its raft of contributors, especially through Wikidata. I'm particularly excited about the latest developments here with Wikibase instances in the cloud that are essentially being designed to be adjacent knowledge graphs focused in a particular domain and context.

Having experimented for the last couple of years with Wikidata itself, I can say with certainty that it's really quite hard to work legitimately within that whole system. It's a nontrivial process to be careful about semantic and structural alignment of concepts and not just throw things into a giant database. You have to figure out what other people mean in the properties available for use and the items used to classify other items. If that's not clear in the definition of those items, you have to dig into the history behind them to understand if they align with your own intent. If not, the responsible thing to do is to jump into the conversation and try to influence things in a direction that is a result of community consensus. That takes a lot of time and energy!

As an alternative, or more of a stepping stone, we can start with a clean instance of the same knowledge organization technology and work things out within a specific context. The responsible way to go about that still involves examining both the very messy global knowledge commons (Wikidata and related things) as well as other ontologies and sources of explicit semantics. As we build out own thing, we should establish linkages to other things, complete with notes on what those relationships mean and how they might be exploited in future. I tend to think we'll end up with a global knowledge commons that is more about pop-up knowledge graph indexes, perhaps using developing tech like Weaviate and others, that reach out and exploit these relationships to develop efficient point-in-time renderings that are optimized for use.

Part of the reason I'm interested in this dynamic for the EEW project is that they are like a lot of groups that really don't have the capacity to engineer and operate a bunch of big data tech. What if, instead, there was some tech that a small non-profit like this could push things to, focusing more on the code to do the work (which they have to anyway) and less on the foundational infrastructure for where the data goes? If that data infrastructure is also fully in the public domain and part of a well-established global organization dedicated to building the global knowledge commons, then we have a pretty good chance of developing something truly lasting.

Software Code

Ultimately, I want to build everything in a Wikibase instance like this from code. Most items here are built with Python codes leveraging the WikibaseIntegrator project. The code project for this is in GitHub. I mostly work in Python notebooks as they give me a chance to write notes as I figure things out, they are easy to share thinking from with other people, and they can be executed all over the place these days. Eventually, I'll work some of that code into more formal deployable packaging.

Acknowlwedgements

I am also taking advantage of the Earth Science Information Partners (ESIP) in this work, most notably the Pangeo instance the ESIPLab operates for building and running code that's building the Wikibase. ESIP is where I first heard about and got interested in the EDGI-EEW project. I fiddled around with the PAWS platform the Wikidata folks operate with (another JupyterLab instance), but it's not as functional for my purposes.

Disclaimer/Disclosure

I'm a sometimes volunteer with the EDGI-EEW, but I spun this wikibase instance up on my own initiative and time. If it proves interesting to carry forward in some more official capacity with the organization/project, we'll recast it at that time.

I also work for a U.S. Federal Government science agency. I only work part time in that capacity now, and I'm doing this experimental work on my own personal time. Some of the concepts I am pursuing and developing here have overlap with research and development I am doing with my "day job," so I will occasionally take lessons learned in this context and apply them in the other. In any case, anything I do or contribute is dedicated to the public domain where I hope others can leverage it. I do carefully try to carefully distinguish between the two hats I wear and work in compliance with my agency rules for both paid and volunteer work.