TANDEM in toto

TANDEM: A Web-Based Text and Image Data Generator

Kelly Blanchat, Jojo Karlin,

Stephen Real, Christopher Vitale

DH Praxis

Spring 2015

ABSTRACT

TANDEM is a Python-based Django web-application that generates text and image data from files submitted by the user. TANDEM is for scholars seeking quantitative insight into a corpus consisting of picture books, comics, advertisements, and other images with overlaid text. The TANDEM application compiles three existing open source technologies: Tesseract OCR, Open Source Computer Vision (OpenCV), and a natural language processing library called Natural Language Toolkit (NLTK). The output is assembled into a summary level .CSV document as well as multiple page-level detail documents. TANDEM is the result of the 2014-15 Digital Humanities Praxis course at the CUNY Graduate Center. Team members include founder and project manager Christopher Vitale, developer Stephen Real, UI/UX designer Kelly Blanchat, and outreach coordinator Jojo Karlin.

WHAT TANDEM DOES

TANDEM, a python based Django web-application, makes data generation more accessible for scholars interested in examining properties of content that contains both words and pictures. TANDEM brings together two processes in one place: Natural Language Processing (Tesseract OCR and NLTK) and Computer Vision (OpenCV). The user uploads files (.jpg, .gif, .tiff, and .pdf) for analysis through a straightforward interface. Once the user submits files, TANDEM uses Tesseract OCR, an open source optical character recognition engine, to convert image to text by analyzing shapes and curves of pixels. Those curves are translated into letter shapes and logged in a .txt file. TANDEM then uses NLTK (Natural Language Toolkit), a free, open source suite of libraries for statistical natural language processing for Python. In TANDEM, NLTK runs a series of basic queries including word count, unique word count, word length, and average word length. As these processes are happening, TANDEM processes visual image data with OpenCV (Open Source Computer Vision). OpenCV queries determine image size, image color, a single average color for the entire page using red, green, and blue color depths (RGB), and a standard deviation for those RGB values.

USE CASES

The output produced by TANDEM can be used as a source for data visualization, quantitative analysis, and distant reading of multimodal print objects. TANDEM is ideal for the distant reading of any content that contains a rich mix of text and images. A classic example of such material is children’s picture books. Other avenues range from advertisements and posters, which employ a combination of text and image to catch readers’ attention, are also a source of rich material for statistical analysis, as well as website design, which requires a balanced combination of text and image. Overall, because the TANDEM data output can generate statistics for both the text and image values, TANDEM is a useful tool for anyone interested in researching the patterns and relationships of these kinds of combinations.

A scholar of children’s literature might want to compare a corpus of one author to that of another, or analyze versions of a classic children’s text, like Mother Goose, over time. A person studying trends in print advertising could use TANDEM to identify patterns in the relationship between text and image that shift over time based on changing aesthetics, or they could identify statistical differences in advertising by product, industry, or periodical.

TANDEM WIREFRAMING

TANDEM provides a tool that, at its core, is designed for ease of use so that interested scholars can focus on analysis and not the time-consuming process of figuring out how to install and run complex computer software. Image 1 outlines the three processes that occur behind the scenes of the user interface in TANDEM.

Image 1: TANDEM wireframe

TANDEM simplifies the process on the user-side. After navigating to dhtandem.com, the user can simply follow the prompts to upload their source corpus, their image files. Once the corpus has been identified and uploaded to TANDEM, the files are analyzed, after which a zip composed of a main .csv file and page specific data files containing TANDEM statistics can be downloaded to the user’s local folders.

DEVELOPMENT AND EVOLUTION OF THE SCRIPT

The process of developing TANDEM began with creating the development environment. The developer installed software onto a personal laptop, where Python and an internet development environment (IDE) called PyCharm had previously been installed. The first major roadblock involved the installation of the core packages required for TANDEM’s functionality — Feature Extractor, Tesseract OCR, and NLTK — which the project manager had researched in advance. Package installation on the Command Line was challenging because when the installation seemed to work, the Python “import” would fail. This problem was compounded with multiple versions of Python running on the Developer’s workstation and a lack of real understanding of how they worked. With support from Digital Fellows at the CUNY Graduate Center, all necessary packages were installed with the exception of Feature Extractor, which was dependent on MatLab, a proprietary software. Based on advice from a meeting with Lev Manovich, a cultural analytics researcher and professor at the CUNY Graduate Center, the team made a strategic decision to switch to OpenCV to handle the image data.

At this point, a very simple prototype was developed in two phases. First, the developer explored the NLTK queries and the project manager investigated the OpenCV image processing. Independently, their scripts were able to achieve a minimal amount of processing. The scripts were then unified, providing a starting point for the TANDEM script that exists in the current iteration of the program. This Python script could process .jpg files from a local hard drive to produce simple data outputs to verify that both NLTK and OpenCV were working. With this working, another challenge emerged. PDF files were not supported by Tesseract. To override this issue, PDFminer was added. Additionally, NLTK had bugs related to the mishandling of inputs and outputs, confusing strings with unicode. Despite these difficulties, progress was strong and steady.

GitHub, a web-based version control environment, was employed for TANDEM code sharing. Throughout the project, versions of TANDEM were pushed to GitHub by the developer so that other team members could retrieve the code. Other team members faced similar challenges with building test environments, but validating the process proved an invaluable exercise to ensure that the application would work on other platforms.

Strategically, the full back-end of the application was built while design and architecture were conceptualized for the front-end. While work continued on building a gradually more robust data engine, the team sought advice on which of the many Python web frameworks to use. Initially, Flask was selected because research had suggested that it presented a lower technology barrier. Though the trivial prototype ran on the Reclaim Hosting Server, there was minimal support available for Flask. At the recommendation of professors Luke Waltzer and Amanda Hickman, the TANDEM application was switched to Django, a more common and previously evaluated framework with more local support available.

At this point there were two parallel programming tasks: to learn Django to build a front-end for the code, and to recreate the necessary Python environment on the server. The goal was to run the entire TANDEM “engine” from the Command Line on the server. However, building that environment was just as difficult as it had been previously. Though a second attempt eliminated some obstacles, server-level permissions issues added complexity.

Thanks in part to Django documentation and tutorials, the process of building a basic prototype went smoothly. At that time, the team decided to persist user data, which meant that SQLite database functionality needed to be added while the front-end was built. Django automatically provides this functionality, making it easy to provide the persistence. The key challenge at this phase was the uploading and downloading of files. Solutions to these problems were identified through internet resource forums like GitHub and StackOverflow, an interactive community board for programming questions.

Finally, it was time to integrate the user interface including styled templates, CSS and images. This integration worked well once it was determined where Django projects store these files and how to point the Django templates at the correct files containing the elements users would see and interact with.

The final end-to-end testing revealed the most tenacious bug, and the symptoms were mixed. At times, testers would report that downloaded results contained data from files that they did not supply. At other times, reports indicated that the download was empty or unzipping the file would generate an error message. Other times the application would simply crash mid-processing. The real problem was determined to be a result of the stateless nature of HTTP. The program creates a set of folders for the user’s project at the start of the process. After the user submits files, the program needs to write them to the correct folders to associate it with the user’s project. At this stage, the user instances and their relation to the folders were not being preserved. As a result, every time a user clicked a button, there was no certainty that the application would connect to their specific interaction. This problem was identified and resolved using a single line of code built into the Django framework, proving again the value of using it to build TANDEM.

USER INTERFACE (UI) & USER EXPERIENCE (UX)

Because TANDEM breaks down technical barriers allowing for a larger user-base of scholarly inquirers, it was important that the user interface reflect those same principles. In conceiving of the look and feel of the application, it was important that the application utilize high-contrast design elements to facilitate straight-forward functionality (Image 2).

Image 2: TANDEM stylesheet by Kelly Blanchat

Before conceptualizing user navigation it was important to ask a few essential questions, such as:

What will users need to know when arriving at dhtandem.com?
How many buttons will be needed for the users to upload their files?
As the users’ file(s) process, what information do they need to know about what is happening in the backend?
How will the users know when the process is complete?
Where and how will the data output download be delivered?
What will the application do post-processing?

The essential concern for the user interface (UI) and user experience (UX) was to ensure that the person on the other side understands what is happening at all times. For example, if the user’s corpus is large, the upload time may be lengthy, so ideally the UI would provide a visible indication of the progress. Additionally, if there were errors, best practices must be clearly stated so that the user could avoid or correct them.

To begin to answer some of these questions and concerns, the UI/UX designer reviewed other applications and websites that prompt the user to input and define information. The review included social media websites, such as Facebook and Twitter, as well as other web tools that accept .csv file formats, such as Serials Solutions from ProQuest. The most user-friendly processes did not require detailed documentation or conditional functionalities and could be ascertained by visual cues. With this in mind, the UI/UX designer decided that TANDEM should include concise, even friendly, language, with no jargon and minimal text on the application’s processing pages (see: Image 3).

Image 3: The current TANDEM platform

From that point, the TANDEM developer proved to be an invaluable collaborator, as some of the functions of the final platform would need to be determined by processing elements from Python and Django.

UI/UX CONCERNS

Most UX components were fairly straightforward to identify for the current platform and application, such as giving the user the ability to browse local folders. However, while TANDEM had aimed to provide users with information-rich error messages, it became difficult to locate a script to produce such prompts. Searching StackOverflow for anything involving “error” in the name retrieved inadequate results, and including “progress” to the inquiry only got to half of the need. Future iterations of TANDEM will include elements and messages that help guide the user.

TEST CORPUS

To explore the functionality of TANDEM in its beta stage, the team employed a test corpus: The Real Mother Goose by Blanch Fisher Wright published circa 1916. The test corpus was identified on HathiTrust as a book in the public domain. After digitizing, the corpus was pushed through the working prototype, and data was generated. Fact checking the data by hand, the team was astonished at the software’s ability to achieve those things they had set out for it to do.

Image 4: Sample from TANDEM’s test corpus of Mother Goose

Image 5: Sample TANDEM output

The test corpus illustrates TANDEM’s ability to streamline data generation from multimodal print artifacts. Because the corpus is available in the public domain, meaning neither an individual or a corporation holds the copyright, TANDEM was able to make the entire corpus — 121 digitized pages — and its data output available on the TANDEM website as a learning tool. As the TANDEM team moves forward with developing the tool, they are also actively exploring the data visualization and analysis potential for the data generated by the tool. Sample visualizations are live on dhtandem.com.

DOCUMENTATION

In order to get around the application’s current lack of error messages, detailed documentation for best practices were developed to assist users with personal troubleshooting. Because TANDEM incorporates open source technologies, it became apparent that those technologies’ documentation would also be necessary for other aspects of the platform, such as Licensing and Terms of Use. This need became even more important as text was developed to describe the test corpus and how copyright comes into effect for users’ files. Thanks to GitHub, the documentation for the open source technologies was fairly straightforward. Developing clear and concise language for best practices and copyright took more creativity, especially when sticking to the clear, friendly language employed on other TANDEM pages.

OUTREACH

From week one, TANDEM’s outreach goals included establishing a sense of the community interested in text and image data, potential users as well as like-minded developers who might help with development and articulation of aims. To get a better sense of the projects happening in New York City, the outreach coordinator attended a number of citywide DH events and made contact with researchers at Columbia and NYU, and reached out to scholars and librarians at Stanford, Princeton, NYPL labs, and the Dumbarton Oaks Research Library, and the Biodiversity Library.

Early outreach involved contacting domain experts. Lev Manovich gave the team some affirmation that TANDEM hits on something other DHers are not quite doing. Interest from Dr. Bill Gleason at Cotsen Children’s Library at Princeton, where they are working on ABC book digitization, signalled TANDEM’s relevance in the field.

In addition to generating response within the crowd who might be interested in a tool like TANDEM, the outreach coordinator wanted to get word to those who might not consider text and image data. To draw people in, she started using the hashtag #picturebookshare. By connecting the tool to picture books and sharing images, #picturebookshare kept the team considering the work they were aiming to examine, and it helped make contact with people in advance of the working prototype. Retweets from Illustrator Jon Klassen and data visualization guru Edward Tufte demonstrated the potential for text and image integration across many fields. While #picturebookshare continues to chime away, the team also now uses the hashtag to generate research ideas for potential TANDEM users. Fun distant futures for TANDEM might involve the visual trajectories of various aspects of books: visuality of covers or book spines, as well as the visual history of education materials.

By the seventh week, TANDEM had a small but invested following on twitter and interest from a number of arenas. Outreach then turned to more domain experts. As the team approached code integration, they knew they would want additional support setting up the web framework. The outreach coordinator attended the DjangoGirls NYC workshop weekend and began dropping in a weekly Django Meetup. Maintaining contact with Geoffrey Sechter, a Django developer, proved valuable for troubleshooting later on. Another valuable contact was Michel Biezunski who, with his company Info Loom, engineered the data management for the I.R.S. and now NYU libraries. The project manager attended meetups at DaniPad NYC Tech Coworking space in Queens, NY, and met Python developers who had insight into working with Django based web-apps. Commercial uses for TANDEM-like applications were brainstormed, and people responded with interest in testing a prototype. The project manager also reached out to programmers in his private sector network including Mike Brittain at Etsy and Matt Makai of Twilio for guidance and opinions about Django, UI/UX, and TANDEM data.

As the launch approached, outreach shifted focus to consolidating project buzz and pulling together the various potential users and proven allies to solidify the project and unify the efforts of the class. Seeking potential applications for TANDEM, the outreach coordinator attended several workshops and conferences– Emily Fuhrman’s D3 data visualization workshop at Studio@Butler, Theorizing the Web 2015 at ICP, and The Verge NYC after party at Thoughtworks. Continuing to garner community support, the outreach coordinator attended a GC Digital Initiatives event and the English department’s Friday Forum. Three weeks out, initial personalized invitations for the launch went out to the digital fellows and DH Praxis friends and family via paperless post, and the press release on the class wiki started to come together.

Friday, May 8, the project manager and outreach coordinator presented a lightning talk at the GC Digital Initiative’s Media Res #1. Seeing TANDEM beside diverse NYCDH projects was illuminating and reassuring. Not only were useful contacts made in the other presenters, but the format offered a chance to see how the final presentation would be received.

Once the team had tested TANDEM with different file types and corpus sizes, the outreach coordinator contacted several key beta-testers: Patrick Smyth, a CUNY GC Digital Fellow and English phd candidate, Dr. Marsely Kehoe, an art historian who specializes in Dutch colonial art, and Susan McGregor, a professor of digital journalism at Columbia’s Tow Center. The initial feedback was positive and speaks to the promise of the tool. A handful of changes were identified and made; the server was reset for the final time hours before the official launch.

NEXT STEPS

Moving forward, TANDEM aims to expand its functionality in a number of areas. Beginning by continuing outreach to recruit beta users to work with TANDEM will be essential to future improvements. User feedback will drive the development priorities of upcoming releases. While TANDEM persists user data now, there is no interface to provide access to it. By providing a user login Interface, users could be able retrieve and/or modify their projects. This feature also will enable sharing of data across projects. TANDEM output is designed to provide input into data-visualization tools. In order to provide an end-to-end platform for users, a future release will include built-in data viz tool(s). These built-ins are already being developed in the form of R, HTML/CSS, and Python based scripts to manipulate and visualize the data that is generated.

Features are also being planned to improve the range of accepted texts by handling a longer list of file types. While TANDEM already supports the most common file types, the team recognizes a need to include a more robust set to open TANDEM’s potential audience even further. Since the quality of the NLTK analysis depends heavily on having accurate text input, a more robust OCR engine is planned. A smarter OCR will provide for recognition of more varied fonts and older fonts, and implementation of Intelligent Character Recognition (ICR) will catapult the tool into the real intersection of text and image. With the ability to recognize words both printed and embedded in the art of the illustrated pages will enable richer analysis and greater scope. This integration is one of the highest priorities for future versions.

As TANDEM expands on the back and front ends, it can include more complex NLTK queries to return topic models, word frequency analyses, relational data in the form of bi-grams (two words that occur together frequently), and tri-grams (three words that occur together frequently). More complex OpenCV queries including shape recognition, image saturation, image contrast data, and user-selection based image processing could add valuable depth to the tool. All of these are possible with the technology that TANDEM already has incorporated. TANDEM’s image processing currently processes a page at a time as if each page were a single image. Image data can be skewed if the image occupies only a portion of the page or the page contains multiple disparate images. Future releases will build algorithms to parse the pages to provide more granular image data.

ACKNOWLEDGMENTS

TANDEM could not have been built without the help of CUNY Digital Humanities Praxis professors Amanda Hickman, Luke Waltzer, Matt Gold and Steve Brier; classmates in Digital HUAC, NYC Fashion Index, and CUNYCast; GC Digital Fellows Evan Misshula, Patrick Smyth, Keith Miyake, and Erin Glass; Tim Owens; Zach Davis; Geoff Sechter; Lev Manovich; Studio@Butler; Jeff Binder; Aaron Plasek; Kathy Koutsis; P Lathrop; Marty Epstein; Stephen Zweibel; Marsely Kehoe; Susan McGregor; Julia Pollack; NYGraphX Corp.

May 25, 2015 by Jojo Karlin in Design Plan, development, Outreach Plan, Project Plan Comments are off

Tagged with: #picturebookshare, @sbreal, abstract, development, dhpraxis14, django, documentation, final paper, minimum viable product, MVP, outreach, project update, TANDEM, UI/UX, web framework, wireframing

S	M	T	W	T	F	S
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30
31