Linguistic Data Consortium

Linguistic Data Consortium The leading non-profit consortium that creates, shares and preserves high quality language resources

05/21/2026

CALLHOME German Lexicon Second Edition was developed by LDC and contains 318,809 German words with morphological, phonological, stress and frequency information. This second edition updates file formats, directory structure and documentation. The first edition is available as CALLHOME German Lexicon (LDC97L18). The words in the lexicon were derived from the CELEX German lexicon (CELEX2 (LDC96L14)) and from 100 transcripts representing unscripted telephone conversations between native German speakers contained in CALLHOME German Second Edition LDC2026S04. This release also includes a pronunciation dictionary derived from the lexicon in CMUdict format. https://catalog.ldc.upenn.edu/LDC2026L04

05/20/2026

CALLHOME German Second Edition was developed by LDC and contains 48 hours of speech from 100 telephone conversations between native German speakers. It is a re-release of the original CALLHOME German collection, combining CALLHOME German Speech LDC97S43 and CALLHOME German Transcripts LDC97T15 with additional transcription and updated directory structure, file formats, and documentation. Participants spoke on topics of their choice in a single telephone call lasting up to 30 minutes. Calls were manually audited for language, recording quality, channel characteristics, dialect, and region. For this second edition, all audio was converted from SPHERE files to FLAC format, and the original training/development partitioning was removed.

In addition to the original transcripts published in CALLHOME German Transcripts, this release has updated transcripts addressing normalization of annotation formats, standardization of speaker-produced and background noises, application of foreign-language marking, whitespace cleanup, and corrections and consistency fixes. https://catalog.ldc.upenn.edu/LDC2026S06

05/19/2026

MADCAT Phases 1-3 Composite Evaluation Set contains the evaluation data created by LDC for Phases 1-3 of the DARPA MADCAT program and the NIST OpenHaRT 2010 and 2013 evaluations. It consists of handwritten Arabic documents scanned at high resolution and annotated for the physical coordinates of each line and token, digital transcripts, and English translations with content and annotation layers integrated in a single MADCAT XML output, for a total of 1,643 images and corresponding annotation files. Source documents were web text and newswire collected by LDC. Arabic speaking scribes copied documents by hand, following specific instructions as to the writing style, writing implement and paper. Each page was scanned and the images annotated.

The goal of the MADCAT program was to automatically convert foreign language text images into English transcripts for use by humans and downstream processes, including summarization and information extraction. The core evaluation task in MADCAT was the translation of handwritten Arabic documents. https://catalog.ldc.upenn.edu/LDC2026T05

05/18/2026

LDC’s May newsletter announces three new publications: MADCAT Phases 1-3 Composite Evaluation Set, CALLHOME German Second Edition and CALLHOME German Lexicon Second Edition http://ldc-upenn.blogspot.com/

04/21/2026

The LORELEI series continues with LORELEI Somali Representative Language Pack, a collection of comprehensive resources for HLT development -- monolingual text, Somali-English parallel text, entity annotation, noun phrase chunking and related software tools – developed by LDC for the DARPA LORELEI program. The LORELEI program was concerned with building human language technology for low resource languages in the context of emergent situations. Data was collected from discussion forum, news, reference, social network, and weblogs. https://catalog.ldc.upenn.edu/LDC2026T03

04/20/2026

MATERIAL Tagalog-English Language Pack was developed by Appen for the IARPA MATERIAL program and contains 100 hours of Tagalog conversational telephone speech, transcripts, English translations, annotations and queries. Calls were made using different telephones from a variety of environments. Transcripts cover 30% of the speech files, 2% of which were translated into English. This release also includes domain annotations, English queries and their relevance annotations. The MATERIAL program focused on underserved languages with the ultimate goal of building cross language information retrieval systems to find speech and text content using English search queries. https://catalog.ldc.upenn.edu/LDC2026S05

04/17/2026

DEFT Chinese and English Light and Rich ERE Parallel Annotation was developed by LDC and contains 179 Chinese discussion forum documents and their English translations annotated for entities, relations, and events (ERE). Light ERE annotation labels entity mentions for the target set of entity, relation and event types between and among those entities including coreference. Rich ERE annotation expands types and tagging in the entities, relations, and events annotation tasks and replaces strict event coreference with a more loosely defined event hopper annotation. 179 Chinese-English document pairs were annotated following Light ERE annotation guidelines; a subset of 171 Chinese-English document pairs were also labeled with Rich ERE annotation. The source data and English translations were drawn from BOLT Chinese Discussion Forum Parallel Training Data (LDC2017T05) originally collected and translated by LDC under the DARPA BOLT program https://catalog.ldc.upenn.edu/LDC2026T04

04/16/2026

Check out our April newsletter for LDC’s latest publications – DEFT Chinese and English Light and Rich ERE Parallel Annotation, MATERIAL Tagalog-English Language Pack and LORELEI Somali Representative Language Pack http://ldc-upenn.blogspot.com/

Address

3600 Market Street, Ste 810
Philadelphia, PA
19104

Alerts

Be the first to know and let us send you an email when Linguistic Data Consortium posts news and promotions. Your email address will not be used for any other purpose, and you can unsubscribe at any time.

Contact The Organization

Send a message to Linguistic Data Consortium:

Share