Linguistic Data Consortium

Home
United States
Philadelphia, PA
Linguistic Data Consortium

The leading non-profit consortium that creates, shares and preserves high quality language resources

06/18/2026

Multi-Language Conversational Telephone Speech 2014 – Spanish & Portuguese is comprised of 123 hours of Spanish and Portuguese telephone speech, specifically, 569 recordings covering Brazilian Portuguese, Caribbean Spanish, European Spanish and Latin American Spanish collected by LDC to support research and technology evaluation in automatic language identification. Portions of these recordings were used in the NIST 2015 and 2017 language recognition evaluations. The collection focused on language pair discrimination for 20 languages/dialects, some of which could be considered mutually intelligible or closely related. https://catalog.ldc.upenn.edu/LDC2026S07

06/17/2026

KAIROS Phase 1 Evaluation Source Data, Annotation, and Assessment was developed by LDC and contains the English and Spanish source data (text, video, images), manual annotations, reference knowledge graphs, the system output assessed during the evaluation, and human assessment results from the Phase 1 evaluation of the DARPA KAIROS Program. The Phase 1 evaluation focused on the improvised explosive bombing scenario with nine complex events and two surprise complex events in the mass shooting scenario.

Manual annotation and assessment of event-relevant documents for 10 complex events are included in this release. Scenario-relevant events and relations were labeled for each document to develop a structured representation of temporally-ordered events, relations and arguments that expressed the scenario-relevant events in each complex event. A reference knowledge graph (Graph G) was developed for each event; systems were expected to match the Graph G with a given schema library. Assessment data includes human assessment judgments and the system output that was manually assessed for the end-to-end evaluation task.

The DARPA KAIROS (Knowledge-directed Artificial Intelligence Reasoning Over Schemas) program aimed to build technology capable of understanding and reasoning about complex real-world events in order to provide actionable insights to end users. https://catalog.ldc.upenn.edu/LDC2026T07

06/16/2026

LDC’s June newsletter contains important announcements for LDC organization account administrators and data licensees as well as details on three new publications http://ldc-upenn.blogspot.com/

05/21/2026

CALLHOME German Lexicon Second Edition was developed by LDC and contains 318,809 German words with morphological, phonological, stress and frequency information. This second edition updates file formats, directory structure and documentation. The first edition is available as CALLHOME German Lexicon (LDC97L18). The words in the lexicon were derived from the CELEX German lexicon (CELEX2 (LDC96L14)) and from 100 transcripts representing unscripted telephone conversations between native German speakers contained in CALLHOME German Second Edition LDC2026S04. This release also includes a pronunciation dictionary derived from the lexicon in CMUdict format. https://catalog.ldc.upenn.edu/LDC2026L04

05/20/2026

CALLHOME German Second Edition was developed by LDC and contains 48 hours of speech from 100 telephone conversations between native German speakers. It is a re-release of the original CALLHOME German collection, combining CALLHOME German Speech LDC97S43 and CALLHOME German Transcripts LDC97T15 with additional transcription and updated directory structure, file formats, and documentation. Participants spoke on topics of their choice in a single telephone call lasting up to 30 minutes. Calls were manually audited for language, recording quality, channel characteristics, dialect, and region. For this second edition, all audio was converted from SPHERE files to FLAC format, and the original training/development partitioning was removed.

In addition to the original transcripts published in CALLHOME German Transcripts, this release has updated transcripts addressing normalization of annotation formats, standardization of speaker-produced and background noises, application of foreign-language marking, whitespace cleanup, and corrections and consistency fixes. https://catalog.ldc.upenn.edu/LDC2026S06

05/19/2026

MADCAT Phases 1-3 Composite Evaluation Set contains the evaluation data created by LDC for Phases 1-3 of the DARPA MADCAT program and the NIST OpenHaRT 2010 and 2013 evaluations. It consists of handwritten Arabic documents scanned at high resolution and annotated for the physical coordinates of each line and token, digital transcripts, and English translations with content and annotation layers integrated in a single MADCAT XML output, for a total of 1,643 images and corresponding annotation files. Source documents were web text and newswire collected by LDC. Arabic speaking scribes copied documents by hand, following specific instructions as to the writing style, writing implement and paper. Each page was scanned and the images annotated.

The goal of the MADCAT program was to automatically convert foreign language text images into English transcripts for use by humans and downstream processes, including summarization and information extraction. The core evaluation task in MADCAT was the translation of handwritten Arabic documents. https://catalog.ldc.upenn.edu/LDC2026T05

05/18/2026

LDC’s May newsletter announces three new publications: MADCAT Phases 1-3 Composite Evaluation Set, CALLHOME German Second Edition and CALLHOME German Lexicon Second Edition http://ldc-upenn.blogspot.com/

04/21/2026

The LORELEI series continues with LORELEI Somali Representative Language Pack, a collection of comprehensive resources for HLT development -- monolingual text, Somali-English parallel text, entity annotation, noun phrase chunking and related software tools – developed by LDC for the DARPA LORELEI program. The LORELEI program was concerned with building human language technology for low resource languages in the context of emergent situations. Data was collected from discussion forum, news, reference, social network, and weblogs. https://catalog.ldc.upenn.edu/LDC2026T03

04/20/2026

MATERIAL Tagalog-English Language Pack was developed by Appen for the IARPA MATERIAL program and contains 100 hours of Tagalog conversational telephone speech, transcripts, English translations, annotations and queries. Calls were made using different telephones from a variety of environments. Transcripts cover 30% of the speech files, 2% of which were translated into English. This release also includes domain annotations, English queries and their relevance annotations. The MATERIAL program focused on underserved languages with the ultimate goal of building cross language information retrieval systems to find speech and text content using English search queries. https://catalog.ldc.upenn.edu/LDC2026S05

Address

3600 Market Street, Ste 810
Philadelphia, PA
19104

Telephone

+12158980464

Website

http://www.ldc.upenn.edu/

Alerts

Be the first to know and let us send you an email when Linguistic Data Consortium posts news and promotions. Your email address will not be used for any other purpose, and you can unsubscribe at any time.

Contact The Organization

Send a message to Linguistic Data Consortium:

Want your organization to be the top-listed Non Profit Organization in Philadelphia?