A Tool For Data Mining Layers
Data mining tools often require users to understand arcane query syntax and to learn complex user interfaces in order to compose data mining queries. What’s needed is data mining that gives non-programmers the power to intuitively construct data mining queries through direct manipulation (just pointing and clicking and without knowledge of underlying query languages) and to visualize data mining results in a form that encourages interactive exploration and discovery.
With these goals in mind, I recently created a data mining tool I call a “stratigraph”.
A stratigraph lets non-programmers explore data by visually assembling queries through direct manipulation and rendering query results in layers. Each layer — one for each predicate of a query — can be subsequently filtered by selecting a value from the current layer. Selecting a value in a layer expands (zooms) the layer to produce a higher layer based on the selected value, all previous layer predicates, and the next user-specified predicate. As exploration continues upward, layers build on top of each other, forming query “spires” that become visual query landmarks. This interactive exploration process is evocative of how geologists decode the story of rock formations by analyzing how strata (layers) evolve. The name “stratigraph” is an homage to this science of stratigraphy.
The stratigraph creates and visualizes data queries (currently SQL-based) made up of multiple predicates conjunctively (AND’d) joined together. As shown above in FIG. 1 for a given query, the stratigraph calculates groups with row counts for each predicate and renders each group as a colored “block” 10 within its predicate layer. Groups with larger row counts are rendered as blocks with larger widths within the block’s query predicate layer. The root (bottom) layer 20 of the stratigraph shows the unfiltered count of the primary query key of interest. Results of a first query predicate render as the first layer 30 above the root layer. In turn, subsequent predicates and their grouped results (which are filtered by previous predicates) render on top of their preceding (ancestor) layers.
The stratigraph in FIG. 1 is built from queries composed of predicates AND’d together — shown as lightly colored gray labels in the background 40 — these predicates match database columns for:
“Category”, “Status”, “Mass Killing”, “Assault Weapon”, “Year”, “State”, “City”, “Age”, “Gender”, “Incident Type”
The root layer of this stratigraph is labeled “All People”. This root layer shows the database contains a total of 14314 “People”. Moving up layer by layer, of these “People”, 12195 are “Victims” and 2119 are “Perpetrators”. Of the “Victims”, there are 9647 “Injured” and 2477 “Killed”. The third layer 50 shows grouped row counts for the “Status” predicate (column). Specifically, the query for this “Status” layer is:
/* Shows counts for “Status” values between ‘1/1/13 and 12/31/19’ for Victims */
SELECT P.`Status`, COUNT(*) FROM Incidents I JOIN Participants P ON I.`Incident ID` = P.`Incident ID` WHERE (I.`date` between ‘2013–01–01’ and ‘2019–12–31’) AND (P.Type = ‘Victim’) GROUP BY P.`Status`
This “Status” layer query produces several blocks — one for each group by result — “Injured”, “Killed”, “Unharmed”, etc. These blocks are arranged horizontally to make up the “Status” layer. Block widths are based on group by counts.
Selecting a block in a layer expands the layer and produces a descendent layer above it. This progressive expansion of query layer predicates forms query spires. Multiple query spires can originate from separate values within a layer.
As shown in FIG. 1, the top predicate of the spire is “City” and the tooltip shows “El Paso ” 60 for the selected block within the query spire. The label at the bottom of the image 70 shows the spire is built up from layer blocks from bottom to top:
All People/Victim/Killed/Mass Killing/With An Assault Weapon/2019/Texas/El Paso 
This query spire translates as:
In 2019, in El Paso, Texas, 22 victims were killed in a mass killing (4+ people total killed at the incident) with an assault weapon
The stratigraph enables non-programmers to easily construct arbitrary database queries via direct manipulation. Queries on the underlying database are triggered by pointing and clicking block values. No typing of input parameters is needed. No understanding of underlying query syntax is needed.
FIG. 2 below shows another query spire 200 where “Perpetrators” was selected as the first predicate above “All People” and then block values were selected for higher layers:
In plain words the query spire shows:
In 2016, in Pittsburgh, Pennsylvania, 2 adult, male perpetrators were unharmed and arrested for a mass killing (4+ people total killed at the incident) with an assault weapon, and the perpetrators were in possession of a gun as a felon or prohibited person
To encourage exploration of the database and to hone in on specific queries of interest, layers of the stratigraph can be dragged and dropped to change the total ordering of query predicates. As shown in FIG. 3, the “Mass Killing” layer is dragged from the fifth layer of the stratigraph to the second layer beneath “Incident Status” 300. This drag and drop reorders the predicates of the query:
In addition, for specific layers where multi-predicate value selection is desirable, multiple blocks in these layers can be AND’d (or OR’d) together by holding down the SHIFT (or META) key while clicking on each block. As shown in FIG. 4, multiple blocks 400 and 410 in the “Incident Type” layer can be AND’d together:
Blocks within layers often correspond to a group with a single row count. Such blocks are one pixel wide and so cannot be readily moused over and selected. To address this user interface challenge, I initially implemented a fisheye lens scheme that dynamically inflated the width of narrow blocks as the pointer moved over the block. This fisheye lens approach was cool but caused blocks to dynamic grow vertically above (and below) the layer. This “bulging” block animation introduced, in my opinion, a “roundness” that looked out-of-place when juxtaposed on the rectilinear layers:
So instead of using a fisheye lens approach, I created an “accordion” scheme whereby blocks dynamically inflate — but keep their rectilinear shape — as a pointer moves over narrow blocks. As shown in FIG. 5, This scheme enables blocks 500 to be readily moused over and labeled with a tooltip regardless of their original width. Furthermore, this accordion scheme preserves the relative widths of adjacent blocks (enabling comparison) unlike the fisheye lens which (deceptively) makes narrower blocks temporarily wider than adjacent blocks:
I originally started out building the stratigraph using D3.js nested treemaps — one treemap for each predicate result — but then decided to prioritize on visualizing how query results become filtered and narrowed as more predicates are AND’ed together. This change of focus led me to change my implementation from nested treemaps to a dynamic icicle:
I inverted the icicle to grow from the bottom upward to match how geological structures build up over time according to the Law of Superposition — that is, the lowest layers are the oldest, and higher layers are the youngest. I then happily discovered that an inverted icicle is well-known as a “flame graph”. The flame graph — inverted icicle — is a powerful data visualization tool that helps find bottlenecks in CPU traces. I ultimately based the stratigraph on Brendan Gregg’s open-source flame graph. Unlike flame graphs — which utilize profiling data from a computer program — the stratigraph had to be adapted for database mining along with the interactive features described above.
My participation in the non-profit project where I created the stratigraph ended before I could refactor, document, and tune the stratigraph code. I hope to polish the code for public release as well as improve the stratigraph with additional functionality:
Bootstrapping — Probing database DDL for schemas and columns in order to semi-automatically generate predicates. Right now the predicate to column mappings are hardcoded. It would be desirable to semi-automate this mapping so the stratigraph worked out-of-the-box on any database.
Query semantics — Enabling range selections within a layer. I have a user interface modality that I think will enable users to specify predicate value ranges in a seamless, direct manipulation fashion. Having range selections would improve the richness of queries.
Navigation — Adding history and bookmarks to capture generated spires. Right now, spires disappear when the stratigraph is refreshed. It would be desirable to capture the query spire history of an entire session.
Animation — Showing how the stratigraph evolves over time-based predicates.
Automatic Information Retrieval — Applying Machine Learning to divine the most interesting and relevant predicates and predicate layer orderings.
Replication — Copy query spires horizontally across other blocks and layers like how formulas are readily duplicated across a spreadsheet.
Rebuilding Spires — After dragging and dropping layers, it would be convenient to replay previous block values that made up query spires prior to reordering the layers.
Let me know if you have any ideas.