It Keeps Thinking It’s A Truck (053)

It Keeps Thinking It’s A Truck is a stream of images from Times Square taken at 1 minute intervals. Using the YOLO Machine Learning model (You Only Look Once), objects within the image are localized, identified and classified — for better or worse.
https://vimeo.com/260324290
As the images scroll by errors in the predictions become evident. The unchanging ticket booth is periodically identified as a truck, fences are benches, and when seated, people are sometimes classified as fire hydrants. Can we see how a statistical model “sees”, that is, can see a truck where there is actually a booth? How will the built environment change to accommodate computer vision?
People are magenta, trucks are green, cars are yellow.
Technicals
The Times Square dataset was collected over the course of a week from a live streaming camera, as detailed in the Routine Grid post. The identification of objects within each image is achieved by using Yolo, a machine learning model for object detection. For each image, Yolo produces a corresponding image with the bounding boxes and labels of identified objects. A explanation of Yolo is detailed in “Notes to Self on Using Yolo“.
A bash script is executed to perform detection on an entire folder of images. However, the script is inefficient, as it loads the individually model for each image.
Next Steps
  • Rather than output an image, save the predictions and bounding box coordinates to a text file for dynamic use elsewhere, such as on the web.
  • Cut out bounding boxes from the image rather than overlaying them on the original image.

Notes to Self on Using Yolo

YOLO (You Only Look Once) applies a single neural network to an entire image.
The network divides the image into a grid of 13×13 cells, from which five bounding boxes are predicted for each. Consequently, there are potentially 845 (13x13x5) separate bounding boxes.
With each predicted bounding box: a confidence score indicates how certain it is that the box actually encloses some object; and the class is predicted, such as a bicycle, person, dog, etc.
The confidence score and class prediction are combined for a final probability of the bounding box containing a particular classification.
A threshold can be set for the confidence score, which is 0.25 by default. Scores lower than this will not be kept in the final prediction.
YOLO is unique because: predictions are made with a single network evaluation, rather than many incremental regional evaluations; and, as the name suggests, it looks at the image only once, rather than sliding a small window across the image, and classifying many times.
The following links provide more robust explanations of the model:

Here I Am — But Not Really (052)

Here I Am — But Not Really composites a figure in real-time onto a geolocated Google Streetview image corresponding to their location.

The map builds on previous location-based and context-questioning explorations, such as “The Other” or “This Is What I See”. In playing with the scale of the digital body, new interactions with the scene emerge: knocking on windows, climbing trees. The digital body is cutout from its own context, allowing it to engage with the two-dimensional street images with objects such as using a chair to climb a set of stairs.

Technicals

The map is built around Kinectron, an application for the peer-to-peer broadcasting of Kinect data. The user accesses a particular website which allows the RGB image feed of a Kinect and location data to be distributed. When other users connect, the location data is used to generate a Streetview representation and composite the live Kinect image feed.

Next Steps
  • Explore compositing multiple users into a single scene.

I See What You See — But Not Really (051)

When you share a location, why is it represented as a point on a map?

I See What You See — But Not Really (051) explores how we identify place from an image. When two people are present on the website, each sees the Google Streetview corresponding to the other’s location. It’s like FaceTime or Google Maps, but represents your location as an image instead of your face or a point on a map.

How is seeing an image of a place different from a point on a map or a live video stream? How does our understanding of place change when seen only through an image?

Only the Streetview is shown, creating an ambiguity in what is actually seen by the person on the other end. Is it a recent photo? Does the time of day correspond? Is this what they are looking at or are they inside of a building? When you see someone else’s location, are you aware that your location is also being shared?

The language on the site shifts from that of a third party (“Give yourself a name / Who are you looking for? / Waiting for Patrick.”) to that of the person on the other end (“Here I am!”).

Technicals

Sharing location data and checking whether both people have connected happens server-side, using an Express application (ELABORATE). After a user creates a name and indicates who they’re looking for, their location coordinates are automatically updated to the server. This is done with the browser’s navigator.watchLocation method. Until the other user connects, the client checks the server every second as to whether or not the second user has joined. Once both users are connected, they receive each other’s location data which is fed into the Google Streetview API to show the corresponding image.

Next Steps
  • Fix styling of the form: replace default fonts and sizes, especially on mobile.
  • Consider showing date and time of Streetview image capture.
  • Make a recording in which one person is changing locations / walking around.
  • Capture a “context” view of the current street-level conditions (for documentation purposes).

Hello Hello Hello (050)

Hello Hello Hello illustrates the difference in writing Hello based on a few of the previous instruction sets: walking city blocks, moving one’s hand through space, tilting a phone, and a software algorithm.

The different contexts, technologies, and relationships to the body are made evident in the juxtaposed drawings. The length of each video is relative to the others, comparing the instantaneousness (or stasis) of the algorithmic method against the slowness of walking. Additionally, within each quadrant, different executions of the same process are looped in.

Technicals

The implementations for each Hello are detailed in the following blog posts: 040, 041, 042.

Each Hello is composited on the web using p5.js. Although the handwritten Hello was captured as a video, frames are read individually to show only the pixels associated with the cyan line.

Next Steps
  • Use recorded coordinate data to produce each Hello rather than a scattershot of different approaches.