Department of Linguistics & Language Development

SJSU LLD Web Site >  Public > LanguageData

Language Data

This page lists the speech and text corpora that LLD has obtained from the Linguistic Data Consortiumlink outside sjsu.edu (LDC), of which LLD is a member.

For information on corpora from other sources besides the LDC, see David Lee's listlink outside sjsu.edu and Chris Manning's listlink outside sjsu.edu.

Some of the corpora listed below can be downloaded directly from our private server (registered users only). Other corpora are marked "CD" or "DVD", which means that we have the data on CD or DVD format in the LLD office.

Catalog ID Description Format
LDC95T7 Penn Treebank, Release 2 download
LDC2002L49 Buckwalter Arabic Morphological Analyzer Version 1.0 download
LDC2005S25 Santa Barbara Corpus of Spoken American English 1 DVD
LDC2005S26 CSLU: 22 Languages Corpus 2 DVD
LDC2005T01 Chinese Treebank 5.0 download
LDC2005T06 Chinese News Translation Text Part 1 download
LDC2005T10 Chinese English News Magazine Parallel Text 1 CD
LDC2005T12 English Gigaword Second Edition 2 DVD
LDC2005T13 CCGbank download
LDC2005T14 Chinese Gigaword Second Edition 1 DVD
LDC2005T23 Chinese Proposition Bank 1.0 download
LDC2005T28 HARD 2004 Text 1 DVD
LDC2005T33 BBN Pronoun Coreference and Entity Type Corpus online
LDC2005T35 ANC Second Release 2 DVD
LDC2006S34 Russian through Switched Telephone Network (RuSTeN) 1 DVD
LDC2006T04 Multiple Translation Chinese (MTC) Part 4 download
LDC2006T13 Web 1T 5-gram Version 1 6 DVD
LDC2006T17 French Gigaword First Edition 1 DVD
LDC2007T02 English Chinese Translation Treebank v 1.0 download
LDC2007T09 ISI Chinese-English Automatically Extracted Parallel Text download
LDC2007T40 Arabic Gigaword Third Edition 1 DVD