Making ‘Making Legible’ Legible: Part 4

As I’ve discussed in previous posts, this project attempts to find relationships of dominant tenancies and abandon nodes within a large body of text. The large collection of text evolved in structure over time, thus examining at from a purely document-based approach is not appropriate.

 If the existing boundaries of a body of text are conventionally documents, this projects instead treats sentences as the primary object and attempts to draws new boundaries around sentences in various documents.

Much of my effort for this project focused on the how to create the structure for these relationships to come about.

To dismantle the existing boundaries and draw new ones is fundamentally a question of how it is organized in the database, meaning a focus on what properties do objects need to have and how can these properties be used?

Before processing the body of text, I crafted a spreadsheet to track which properties were inherited or unique and content or context related. For example, context included the IDs of adjacent sentences, whether the sentence was part of a duplicate document, and its relative position within a document.

Once all the text had been atomized into the database collections, my focus shifted to how to compare the similarity of sentences. Similarity between sentences across time is the building block for identifying the dominant tendancies within the text.

After running into memory and time problems while attempting to compare every sentence to every other sentence, I moved to a comparison method in line with my ambitions for representation. Sentences were grouped into rows with each row representing a single date. The sentences in one row are compared only to those in follow four rows. This limits the number of comparisons while still recognizing that an appropriate match may not be in the immediately adjacent time period.

This comparison data was stored in a separate collection from the sentence objects themselves.

Difficulty with string comparisons….

  

Making ‘Making Legible’ Legible: Part 3

Since processing the text documents, I’ve been refining the goal of “finding latent (content and contextual) relationships within a large corpus of texts”. As the text remains a work in progress, I want to focus on how it has evolved and continues to evolve. A genealogical approach to text-relationships can be used to identify what pieces have been disregarded or ignored (and thus require further inspection) or identify the dominant tendancies and trains of thought.

An interesting writing tool for collaboration and version control: http://docs.withdraft.com

Beyond looking at the past, I think this project can provide a foundation for developing a writing tool that moves beyond version control or collaborative commenting. Version control tends to provide a fine-grain binary approach: it compares two things and extracts the insertions or deletions. While this is helpful in an isolated scenario, I’m interested in broader developments across multiple objects over many time periods. Alternatively, version control also provides a high-level view indicating change-points over a long time, but those points of change are overly simplified – often represented by just a single dot. Without context or without knowing what specific time a change was made, this larger overview provides little information beyond the quantity and frequence of changes. Through a geneological and contextual approach to analyzing an existing body of text, I’m hoping to identify what sort of relationships could inform the writing and editing process.

With all the data now added to the database, I’ve been exploring sentence similarity. The diagram below shows the process I’ve gone through up to this point.

Once I’ve computed a two-dimensional array mapping the similarity of all sentences to each other, I plan on using that information to create visual interface for explore those relationships. The wireframes below are a rough sketch of what form this might take.

Making ‘Making Legible’ Legible: Part 2

The structure of data has profound consequences for the design of algorithms.
– “Beyond the Mirror World: Privacy and the Representational Practices of Computing”, Philip E. Agre

To atomize the entire corpus of text, the server processes each document upload to create derivative objects: an upload object, a document object, sentence objects, and word objects. By disassociating the constituent parts of the document, they can then be analyzed and form relationships outside that original container. I’ll discuss those methods of analysis in a later blog post. The focus of this post is how the text is atomized and stored because as Agre pointed out, the organization of data fundamentally underpins the possibility of subsequent analysis.

The individual objects are constructed through a series of callback functions which assign properties. These functions alternate between creating an object with its individual or inherited properties (i.e. initializing a document object with a unique ID, shared timestamp, and content string) and updating said object with the relational properties (i.e. an array of the IDs of all words contained within a document). By necessity, some of these properties can only be added once other objects are processed. The spreadsheet below shows the list of properties and how they are derived.

Properties for each object type

Additionally, as discussed in the previous post, the question of adjacency (or context) is a significant relationship. After the words or sentences are initialized with their unique IDs, the callback function then reiterates over them to add a property for the ID of the adjacent object.

At the sentence level, because the original documents were written in markdown, special characters had to be identified, stored as properties and then stripped from the string. While the “meaning” and usage of these characters is not consistent over time or across documents, they can later be used to identify and extract chunks from document.

Below is an example excerpt of a processed output, from which the individual objects are added to the database. The full code for processing the document upload can be found here.

On Clay Shirky’s ‘Here Comes Everybody’

Some thoughts:

Through the lens of social media, Shirky illustrates McLuhan’s initial proposition that the message of a given technology is the resulting change in human relationships over space and time (the psychic and social consequences). However, he points to ‘professional narcissism’ as the reason for newspapers’ obliviousness to the effect of social media. I’d argue that it’s not a question of professional bias as to why traditional publishing misjudged the role of social media and amateur publishing, but rather an inability for anyone to foresee something that does not yet exist. Hindsight is valuable precisely because we can look at a time and space we are no longer in. Sidenote: he omits architects as professionals, but I’d agree they are most representative of this quotation: “[a professional] pays as much or more attention to the judgment of her peers as to the judgment of her cus­tomers when figuring out how to do her job.” (Bolding mine, recovering from architecture for ever.)

I also found the briefly touched on question of physicality to interesting. Shirky writes, “Digital means of distributing words and images have robbed newspapers of the coherence they formerly had, revealing the physical object of the newspaper as a merely provisional solution; now every article is its own section.” Thinking back again to McLuhan who argued that the typographic cultural bias equated linearity with rationality, perhaps this also is reflect in the newspaper’s inability to forecast the impact of social media.

On Heidegger’s ‘The Question Concerning Technology’

Heidegger argues that the essence of technology is an orientation with the natural world, called “enframing”, through which the “real” is revealed in a certain way. Enframing challenges and orders nature as a “standing reserve” of raw material, but is also dangerous by obscuring all other ways of revealing with its seemingly “destined” order. Heidegger concludes by arguing that art is a saving power within enframing. Art is an alternative mode of revealing the “real”, where nature is brought-forth through reflection, rather than being challenged or set-upon.

PDF of slides from in-class discussion