It is possible to download the corpus in two formats: TXT (the transcriptions from the version accessible through the Basic Search) and in XML (the transcriptions with the morphosyntactic tags available through the Advanced Search).

Corpus Download in TXT

This version allows you to download an updated version of the available corpus through the Basic Query in real-time. It should be cited as:


Fernández-Ordóñez, Inés (dir.): Corpus oral y sonoro del español rural [download date].

Corpus Download in XML

This version makes available to researchers the different tagged versions of the corpus developed by F. Javier Pueyo Mena, using the Freeling library package. It should be cited as:


Version 3.0 (May 2022) - Pueyo Mena, F. Javier: Corpus oral y sonoro del español rural etiquetado. Versión 3.0 [May 2022].

Version 2.0 (December 2020): - Pueyo Mena, F. Javier: Corpus oral y sonoro del español rural etiquetado. Versión 2.0 [December 2020].



1) XML Tags and their attributes:


<turno>		id, mp3
<HS>		id
<HCRUZ>	id
<NP>		id
<emisiones>	id
<VS>		id
<tempo>	id
<pron>		id
<pausas>	id
<lit>		id
<intel>		id
<gestos>	id
<interr>		id
<punct>		id, lemma, pos
<w>		id, lemma, pos


2) In the textual content of the <w> (word) tag, the disambiguation of dialectal forms with the standard form is indicated by the symbol "=" (equal):


		<w id="8178" lemma="cada" pos="DI0CS0">ca=cada</w>

		<w id="11255" lemma="casa" pos="NCFS000">ca=casa</w>

		<w id="5132" lemma="cal" pos="NCFS000">ca=cal</w>


3) The textual content of the <punct> tag always carries the "~" symbol, before or after the punctuation mark, indicating that the sign is either an opening sign:


		<punct id="9359" lemma="«" pos="Fra">«~</punct>


or a closing sign:


		<punct id="9363" lemma="»" pos="Frc">~»</punct>


Sometimes this will be a redundant symbol since the "PoS" attribute of some punctuation marks already indicates this fact: « (Fra), » (Frc), etc.


4) The "id" attribute of the element <punct> is the same one as the word (<w>) that either precedes:


		<w id="364" lemma="yo" pos="PP1CSN00">yo</w>
		<punct id="364" lemma="." pos="Fp">~.</punct>


or follows:


		<punct id="345" lemma="¿" pos="Fia">¿~</punct>
		<w id="345" lemma="y" pos="CC">Y</w>


5) Those Proper Names that could reveal the identity of the informants have been anonymized:


		<w id="2716" lemma="Anonimizado" pos="NP00000">Anonimizado</w>


All other Proper Names remain:


		<w id="5" lemma="dulantzi" pos="NP00000">Dulantzi</w>


6) The "PoS" attribute contains the morphosyntactic analysis of each word using the EAGLES tag set, following the format developed for Spanish in the FreeLing library package. These tags have been slightly modified. For example, terms tagged by Freeling as possessive (el tuyo) or indefinite "pronouns" (article + uno, otro, más, poco, mucho) have been categorized as possessive or quantifying adjectives, respectively. The "indefinite determiners" have been grouped as quantifiers (un, algún, ningún, otro, mucho, poco, tanto, todo, cada, más, menos).