Making ‘Making Legible’ Legible: Part 5

Building on my previous work analyzing a large corpus of text, I continued to explore how the connections across various documents could be presented. My prior work on the project focused on constructing the database to allow for as much cross-analysis as I could (at that time) imagine and building out route in express.js and node for accessing the data. With the eventual goal of uncovering a geneaology across the texts, I’ve been looking at both document-level comparisons and sentence-level comparisons.

The focus of this iteration centered on how does the user move across these scales and what information is relevant at these scales?

The initial landing page is imagined to be a geneology of documents. Currently, this is shown with simply the established folder-document hierarchies, but I intend to evolve this into a content-based “family tree”. This would also incorporate the aspect of time on the Y axis. (The author often brought in material between documents rather than working out of a single document chain.)
When hovering on a document, the similarity to other documents would be shown by size and color. This offers additional information for identifying which document(s) to further investigate.
Clicking on a document from the geneaology then compares that document to all other documents at the sentence level. Each compared document is represented by a pixel-array in which each sentence in the document-pair is compared using the dice co-efficient method. This similarity value is again mapped via size and color. When dominant diagonals are evident, it indicates a high level of similarity within a portion of two documents. The jump between the geneaological view to the comparison view seems very disconnected and something that still needs a lot of work.

From the array of arrays, a document-pair comparison can be isolated and the user can now finally read the constituent sentences when hovering. This raises the question of what use is the investigation if the readable sentences are buried so deep in the interaction/piece? On one hand, with 680 documents, it’s imposible to get a sense of the ‘whole picture’ without some form of abstraction. But how can the abstraction still be relevant? For me, within this project, the abstraction is about constructing and revealing relationships across the corpus — in a (not-yet-realized) attempt to get beyond the established document and sentence boundaries.

The above visuals are my ambition for the project while the video below shows its current (rudimentary) coded form.

 

Making ‘Making Legible’ Legible: Part 4

As I’ve discussed in previous posts, this project attempts to find relationships of dominant tenancies and abandon nodes within a large body of text. The large collection of text evolved in structure over time, thus examining at from a purely document-based approach is not appropriate.

 If the existing boundaries of a body of text are conventionally documents, this projects instead treats sentences as the primary object and attempts to draws new boundaries around sentences in various documents.

Much of my effort for this project focused on the how to create the structure for these relationships to come about.

To dismantle the existing boundaries and draw new ones is fundamentally a question of how it is organized in the database, meaning a focus on what properties do objects need to have and how can these properties be used?

Before processing the body of text, I crafted a spreadsheet to track which properties were inherited or unique and content or context related. For example, context included the IDs of adjacent sentences, whether the sentence was part of a duplicate document, and its relative position within a document.

Once all the text had been atomized into the database collections, my focus shifted to how to compare the similarity of sentences. Similarity between sentences across time is the building block for identifying the dominant tendancies within the text.

After running into memory and time problems while attempting to compare every sentence to every other sentence, I moved to a comparison method in line with my ambitions for representation. Sentences were grouped into rows with each row representing a single date. The sentences in one row are compared only to those in follow four rows. This limits the number of comparisons while still recognizing that an appropriate match may not be in the immediately adjacent time period.

This comparison data was stored in a separate collection from the sentence objects themselves.

Difficulty with string comparisons….

  

Making ‘Making Legible’ Legible: Part 3

Since processing the text documents, I’ve been refining the goal of “finding latent (content and contextual) relationships within a large corpus of texts”. As the text remains a work in progress, I want to focus on how it has evolved and continues to evolve. A genealogical approach to text-relationships can be used to identify what pieces have been disregarded or ignored (and thus require further inspection) or identify the dominant tendancies and trains of thought.

An interesting writing tool for collaboration and version control: http://docs.withdraft.com

Beyond looking at the past, I think this project can provide a foundation for developing a writing tool that moves beyond version control or collaborative commenting. Version control tends to provide a fine-grain binary approach: it compares two things and extracts the insertions or deletions. While this is helpful in an isolated scenario, I’m interested in broader developments across multiple objects over many time periods. Alternatively, version control also provides a high-level view indicating change-points over a long time, but those points of change are overly simplified – often represented by just a single dot. Without context or without knowing what specific time a change was made, this larger overview provides little information beyond the quantity and frequence of changes. Through a geneological and contextual approach to analyzing an existing body of text, I’m hoping to identify what sort of relationships could inform the writing and editing process.

With all the data now added to the database, I’ve been exploring sentence similarity. The diagram below shows the process I’ve gone through up to this point.

Once I’ve computed a two-dimensional array mapping the similarity of all sentences to each other, I plan on using that information to create visual interface for explore those relationships. The wireframes below are a rough sketch of what form this might take.

Making ‘Making Legible’ Legible: Part 2

The structure of data has profound consequences for the design of algorithms.
– “Beyond the Mirror World: Privacy and the Representational Practices of Computing”, Philip E. Agre

To atomize the entire corpus of text, the server processes each document upload to create derivative objects: an upload object, a document object, sentence objects, and word objects. By disassociating the constituent parts of the document, they can then be analyzed and form relationships outside that original container. I’ll discuss those methods of analysis in a later blog post. The focus of this post is how the text is atomized and stored because as Agre pointed out, the organization of data fundamentally underpins the possibility of subsequent analysis.

The individual objects are constructed through a series of callback functions which assign properties. These functions alternate between creating an object with its individual or inherited properties (i.e. initializing a document object with a unique ID, shared timestamp, and content string) and updating said object with the relational properties (i.e. an array of the IDs of all words contained within a document). By necessity, some of these properties can only be added once other objects are processed. The spreadsheet below shows the list of properties and how they are derived.

Properties for each object type

Additionally, as discussed in the previous post, the question of adjacency (or context) is a significant relationship. After the words or sentences are initialized with their unique IDs, the callback function then reiterates over them to add a property for the ID of the adjacent object.

At the sentence level, because the original documents were written in markdown, special characters had to be identified, stored as properties and then stripped from the string. While the “meaning” and usage of these characters is not consistent over time or across documents, they can later be used to identify and extract chunks from document.

Below is an example excerpt of a processed output, from which the individual objects are added to the database. The full code for processing the document upload can be found here.