This version allows you to download an updated version of the available corpus through the Basic Search option in real-time. It should be cited as:
Fernández-Ordóñez, Inés (dir.): Corpus oral y sonoro del español rural [download date].
This version gives researchers access to the different tagged versions of the corpus and was developed by F. Javier Pueyo Mena, using the Freeling library package. It should be cited as:
Version 4.0 (March 2024): - Pueyo Mena, F. Javier: Corpus oral y sonoro del español rural etiquetado. Versión 4.0 [March 2024].
Version 3.0 (May 2022) - Pueyo Mena, F. Javier: Corpus oral y sonoro del español rural etiquetado. Versión 3.0 [May 2022].
Version 2.0 (December 2020): - Pueyo Mena, F. Javier: Corpus oral y sonoro del español rural etiquetado. Versión 2.0 [December 2020].
1) XML Tags and their attributes:
<turno> id, mp3 <inf> <HS> id <HCRUZ> id <NP> id <emisiones> id <VS> id <tempo> id <pron> id <pausas> id <lit> id <intel> id <gestos> id <interr> id <punct> id, lemma, pos <w> id, lemma, pos
2) In the textual content of the <w> (word) tag, the disambiguation of dialectal forms with the standard form is indicated by the symbol "=" (equal):
<w id="8178" lemma="cada" pos="DI0CS0">ca=cada</w> <w id="11255" lemma="casa" pos="NCFS000">ca=casa</w> <w id="5132" lemma="cal" pos="NCFS000">ca=cal</w>
3) The textual content of the <punct> tag always carries the "~" symbol, before or after the punctuation mark, indicating that the sign is either an opening sign:
<punct id="9359" lemma="«" pos="Fra">«~</punct>
or a closing sign:
<punct id="9363" lemma="»" pos="Frc">~»</punct>
Sometimes this will be a redundant symbol since the "PoS" attribute of some punctuation marks already indicates this fact: « (Fra), » (Frc), etc.
4) The "id" attribute of the element <punct> is the same one as the word (<w>) that either precedes:
<w id="364" lemma="yo" pos="PP1CSN00">yo</w> <punct id="364" lemma="." pos="Fp">~.</punct>
or follows:
<punct id="345" lemma="¿" pos="Fia">¿~</punct> <w id="345" lemma="y" pos="CC">Y</w>
5) Proper Names that could reveal the identity of the informants have been anonymized:
<w id="2716" lemma="Anonimizado" pos="NP00000">Anonimizado</w>
All other Proper Names remain:
<w id="5" lemma="dulantzi" pos="NP00000">Dulantzi</w>
6) The "PoS" attribute contains the morphosyntactic analysis of each word using the EAGLES tag set, following the format developed for Spanish in the FreeLing library package. These tags have been slightly modified. For example, terms tagged by Freeling as possessive (el tuyo) or indefinite "pronouns" (article + uno, otro, más, poco, mucho) have been categorized as possessive or quantifying adjectives, respectively. "Indefinite determiners" have been grouped as quantifiers (un, algún, ningún, otro, mucho, poco, tanto, todo, cada, más, menos).