NeIC Conference 2013: Report on "Digital Humanities - A New Era?"

From neicext
Jump to navigation Jump to search


One society needs a research infrastructure for all areas.

"Tidying up the Basement: A Tale of Large-Scale Parsing on National eInfrastructure" (Stephan Oepen)

UiO, Institutt of informatikk.
  • There is hard science in the humanities, or at least relevant science.
  • HPC has been of use in Stephan's own research.
  • Natural language processing. Google translate, bing. Impossible today to launch web search without translate.
  • Conversational interfaces.
  • What is NLP?
    • IBM watson system outperformed champions on American Jeopardy.
    • inherently data intensive
    • relevant for "knowledge society".
    • Can advance digital humanities, e.g in linguistics and philosophies.
  • NLP is a hard problem, because ambiguity and nuances.
  • "Den andra vägen till Bergen är koratre" -> "Other one autobahn against Mountains am abrupt."
  • Syntactic analysis is construction of parse trees. Semantic analysis ranks trees by probabilities.
  • Train AIs on frequencies of occurrence of semantic structures -> rank trees.
  • Translations are trained on pairs of likely translated documents.
  • SYmbolic computation, symbols (S, NP, ...) have internal structure, each with ~400 node DAGs. Parse time 10s for 20 word sentence.
  • Statistical modelling, 10s of millions of parameters, Large training set 300 gbytes.
  • Many subproblems are embarrasingly parallel. Some are memory intensive, non-local. Using 1Tb ram machines for CRF parameter estimation.
  • Manual labor decreases by having better hardware, but demands will always grow to match the supply.
  • Semanto-syntactic analysis of wikipedia, can draw meaning from text. 1.3 million articles, 55 million utterances, ~900 million tokens.
  • Running all this takes 120000 core hours (two days, 14h on Abel) outputs 200Gb of compressed data. Use in example discovery (US), distributional semantics, ...
  • HPC computing in research group took a lot of time and resources. 10000 sentences / month. Now, using Abel we run wikipedia every few days.
  • Currently building portal (science gateway)

Take-home message

  • Language techniologies are inherently computational and data-driven.
  • profile of group shaped by HPC access.
  • Great not to have compute in basement. Federate and amass in large installations.
  • Received advanced user support and technical special solutions by NoTur.
  • More and more areas go computational.
  • Portals as a new model for access and allocation.
QA
  • Have you considered MapReduce? No. Not available. Immediate gains not readily apparent.
  • How should IP engage with humanities? Norway has policy of broadening range of disciplines. Also had cluster experience. Mutual respect is key, can be increased by interest about finding out possibilities.

"Nordic Contributions to Developing a European Digital Services Infrastructure for Social Sciences and Humanities" (Hans Jørgen Marker)

director Swedish National Data Service
  • Keep research data available and accessible for future research, in humanities, medicine and social sciences
  • CESSDA for european coordination of such data archives. On the way to become ERIC on the 18th of June.
  • DDI 1.0 describes the whole process from start to end of a research project.
  • CESSDA ERIC should have been in Norway (impossible), and until this becomes possible, it is being driven as a Norwegian AS.
  • CESSDA data infrastructure?
    • Integrated research discovery tools
    • SSO
    • Extensible
    • Certification/auditing
    • Professionalization
    • Standards development
  • ESFRI Cluster projects - BioMedBridges is one, CESSDA part of DASISH (18 ESFRI partners). Others are ENVRI and CRISP.
  • Consortium: CESSDA, ESS, CLARIN, DARIAH overlapping organisation, SHARE on the side.
  • DataCite: acknowledge that Data are valid scientific contributions.

Fitting it all together:

  • Too many SSH (social sciences and humanities) projects are addressing the same issues in the same way.
  • EC receives new application from scratch
  • All the project are invited to Gothenburg in October.
QA
  • How can political engagements lead to standardisation, e.g. DOI vs handles? Sometimes standards can complement each other, but these do not. Solution is in organisation, to level out the standards. The Nordics is not in opposition of European and beyond, but rather a different level of the same thing. The Nordic coodination is often informal, which we are very good at.

"Nordic Opportunities for Digital Humanities" (Erik Champion)

Project lead for DIGHUMLAB
  • "What do you do with a million books?" Stephen Ramsay 2010
  • Paper machines (JSTOR, Zotero)
  • DH commons - speed dating for DH
  • arts-humanities.net something likle PLOS1.
  • humanites 3.0 - primer for new DH people.
  • DH infrastructure should have feedback ways to developers.
  • Avoid bureaucracy - (horse barn example from US with table and chairs).
  • DIGHUMLAB has and promotes a wide plethora of tools, gateways and services.
  • Launch in 2012.
  • ERIC status: CLARIN approved, DARIAH hopefully approved by mid 2013.

Why DIGHUMLAB

  • develop tools
  • develop danish and EU policy
  • develop communities

DH ecosystem

  • DH commons, NeDiMAH, DiRT bamboo, Open Edition, Open Library of Humanities, DARIAH
  • HACK4LT - Get programmers and humanities in the same room for 24h. "this shows what you can do with students, pizza and prizes"

Take away point

  • Digital humanities can help scholars create and share.
  • Audiance is instant and everywhere.
  • DIGHUMLAB is part of 2 ERICs

Panel Discussion

How do we do infrastructure? Where do we focus? Technology, projects? Where's the challenge.

Stephen:

  • HPC for the masses. The entry level is too high. Get rid of the linux command prompt, improve relations.

Hans:

  • SNIC provides big muscles. Nitty gritty does not go in there. SNIC is fine though. You can always develop, but fine for now. Do more of the same.

ERIC:

  • Better explanations of what exists and can be done. Consolidating and inventorizing.
IT departments are bottlenecks rather than facilitation.
  • NRENs need to offer application support (=advanced user support) and trouble shooting for new users.

Eric:

  • Give staff one day a week to do whatever.

Stephen:

  • Application support (=advanced user support) is needed.

Session Summary

Lessons Learned

  • Natural language processing has become internationally competitive thanks to high-performance computing. For example in semanto-syntactic analysis of Wikipedia.
  • There is an expressed need in digital humanities to have other user interfaces to high-performance computing than a linux command prompt.

Future Directions

  • Too many projects in social sciences and humanities are addressing the same issues in the same way. This can be alleviated through consolidating and networking.
  • Application support or advanced user support would be very helpful in the diugital humanities.

Opportunities

  • Creting a Nordic dataset/service index in the style of data.gov would be helpful.