iSGTW - International Science Grid This Week
iSGTW - International Science Grid This Week
Null

Home > iSGTW 22 August 2007 > iSGTW Feature - Many millions of manuscripts: data mining and digitized objects

 

Feature - Many millions of manuscripts: data mining and digitized objects


As this Computation Institute conference room wall suggests, pen and “paper” is still a popular way to record information. But, as is also suggested, digitization can offer much in the way of improving the accessibility and utility of written materials.
Image courtesy of the Computation Institute.

Just three years old, the Computation Institute’s Teraport has already consumed 2.5 million hours of computing time on more than 800,000 jobs.

James Evans, Assistant Professor in Sociology at the University of Chicago, is a Teraport regular, routinely occupying up to 30 processors at a time for his work on citation network analysis.

Multiplying results

Crunching through hundreds of CPU hours, Evans identifies patterns of interaction between universities and the biotechnology industry, using Teraport to compare the citations of every article with those of every other article in his database—more than 25 million citations.

In work that requires even more computing power, Evans also analyzes the relationships between authors and organizations producing these documents, and the words within them, to identify the scientific subfields they address.

Distributed computing was possible at his doctoral institution, Evans said, “but the computers on the network were of different sizes, had slightly different software and sometimes different operating systems.”

Teraport offers a uniform operating system and software that, combined with other features, has in some cases saved Evans months of computing time, he said.

 
The ARTFL Project teamed up with Alexander Street Press and Teraport’s computing power to create a database including playbills and posters of more than 1,200 works written by black playwrights between 1850 and 2000. This poster promotes the premiere of Ed Bullins’ Duplex, performed in Harlem in 1970.
Image courtesy of ARTFL and the Hatch-Billops Collection.

Ten million digital books

The analysis of the growing numbers of digitized books and text poses massive challenges and opportunities, and UC has joined a consortium of twelve universities working  to digitize up to ten million books as part of the Google Book Search Project.

“In digital humanities we will be facing massive amounts of textual material in the next three or four years,” said Mark Olsen, Assistant Director of the Project for American and French Research on the Treasury of the French Language.

“There are a number of teams, including the ARTFL Project, which are ramping up to adopt machine-learning technologies on how to handle a million books.

Olsen said that although the amount of computer power required by ARTFL’s projects is probably tiny compared to projects from the sciences, it is nevertheless critical to have access to this power.

“Even small tests on our highest-power machines would take 15 or 20 hours to run. These kinds of runs are much faster on the Teraport,” Olsen said. “It extends our capabilities quite a bit.

The Teraport cluster is part of the Open Science Grid and is a project of the Computation Institute, a joint entity of the University of Chicago and Argonne National Laboratory.

- Steve Koppes, University of Chicago

 

Tags:



Null
 iSGTW 22 December 2010

Feature – Army of Women allies with CaBIG for online longitudinal studies

Special Announcement - iSGTW on Holiday

Video of the Week - Learn about LiDAR

 Announcements

NeHC launches social media

PRACE announces third Tier-0 machine

iRODS 2011 User Group Meeting

Jobs in distributed computing

 Subscribe

Enter your email address to subscribe to iSGTW.

Unsubscribe

 iSGTW Blog Watch

Keep up with the grid’s blogosphere

 Mark your calendar

December 2010

13-18, AGU Fall Meeting

14-16, UCC 2010

17, ICETI 2011 and ICSIT 2011

24, Abstract Submission deadline, EGI User Forum

 

January 2011

11, HPCS 2011 Submission Deadline

11, SPCloud 2011

22, ALENEX11

30 Jan – 3 Feb, ESCC/Internet2

 

February 2011

1 - 4, GlobusWorld '11

2, Lift 11

15 - 16, Cloudscape III


More calendar items . . .

 

FooterINFSOMEuropean CommissionDepartment of EnergyNational¬†Science¬†Foundation RSSHeadlines | Site Map