Skip to content

Latest commit

 

History

History
129 lines (84 loc) · 15.1 KB

final_report.md

File metadata and controls

129 lines (84 loc) · 15.1 KB

Final report

1. Introduction

The purpose of this project is to do a cross-dialectal analysis of two competing Spanish diminutive suffixes (-ito, -illo) in terms of productivity; i.e., the extent to which a morphological pattern can be applied to new bases and form new words. In this analysis I also consider dialectal variation, given that the data might not necessarily indicate the same trends for all varieties of Spanish. The main goals are the following:

  1. Explore the cross-dialectal distribution of two competing diminutive suffixes in a representative, cross-dialectal corpus.
  2. Apply statistical measures of productivity (Baayen, 2009) to the data.
  3. Determine whether differences are reflected across varieties.

2. Background

The function of diminutive suffixes is to form a complex word denoting a smaller version of the base (Haspelmath & Sims, 2010). In Spanish, diminutive formation is a robust process attested in all varieties, and it applies widely to nouns and adjectives, as well as a limited set of adverbs and gerunds.

Statistical measures of productivity (Baayen, 2009), explained below, are the second anchor of this project.

  1. Realized productivity: the size of the morphological category, as measured by the type count of the members of the category in a corpus with N tokens.
  2. Expanding productivity: the rate at which a category is attracting new members, as measured by the number of words in the category that occur only once in a corpus of N tokens; the hapax legomena.
  3. Potential productivity: also known as the category-conditioned degree of productivity or Baayen's P. A category that can produce a high number of occasionalisms is thought to be productive, and P measures the likelihood that a word randomly drawn word from a corpus and exhibiting a given morphological pattern is an occasionalism. This measure is computed by dividing the number of hapax legomena exhibiting a pattern by the total number of tokens exhibiting the same pattern in a corpus. A drawback of this measure is that it is sensitive to corpus size, which is why an alternative, the hapax-conditioned degree of productivity or P*, is also used. The latter is computed by dividing the number of hapax legomena exhibiting a pattern by the total number of hapax legomena (exhibiting any pattern) in the corpus. For both measures, the higher the resulting number, the higher the productivity. When presenting results, it is standard for both P and P* to be rounded up three decimal places.

3. Research questions and hypotheses

The (refined) research questions that guided my exploration and analysis were the following:

  1. What is the cross-dialectal distribution of each suffix?
  2. Are there differences in statistical measures of productivity?
  3. Are differences reflected across varieties?

For the first question, I hypothesized that -ito would be more frequent across all metrics following prior descriptive work. With regard to the second question, I hypothesized that differences would be reflected in statistical measures of productivity. Lastly, for question three I hypothesized that there would be differences when examining the results by variety given that -illo is claimed to be more productive in Peninsular Spanish.

4. History and process

My found data comes from the Corpus del español, which is a cross-dialectal corpus created in 2016 and totaling 2 billion words. 21 Spanish-speaking countries are represented in the corpus. The corpus is searchable online and the full data set, which is fully lemmatized and POS-tagged, is available with a license. The data set is available in three formats: (i) Database (Structured Query Language), (ii) Word/lemma/PoS, and (iii) linear (raw) text. All are .txt files and the former two are tab-delimited. I used the second format due to its compatibility with the Pandas data frame structure. The three main challenges I faced when working with this data set were its size (around 70 GB when uncompressed), the structure of the directories, and the extraction of relevant rows for my analysis.

In the first notebook I took care of corpus processing. For this purpose, I created a set of functions that would loop through each directory (one for each country) and create data frame objects that contained only rows ending in the segments of interest. I also created a list of hapax legomena found in each subset of the corpus since it was needed to compute statistical measures of productivity. I kept the code as streamlined as possible but did use one cell for each variety given the memory limitations of my computer. If I were to submit the script as a .slurm job I would condense the script further with another definite loop.

In the second notebook I focused on cleaning efforts and exploratory analysis. I created a master_DF object including all varieties and refined the POS column to removed syntactic categories to which the morphological pattern does not apply. The problem, however, was that there were still many tokens that did not belong in the data frame because they are (i) lexicalized forms that have acquired a meaning of their own, or (ii) words that meet the word class and phonological requirement but simply do not happen to be diminutives. Based on my knowledge of the topic, the forms to be removed are much more frequent than diminutivized forms. It follows that if a list of highly frequent forms ending in the segments of interest is extracted from the corpus, it can be cross-compared with my data frame's Lemma column. Coincidentally, the corpus offers a useful resource for this purpose; that is, a lexicon. I loaded it, derived a frequency-based list of lemmas to exclude from it, and then actually excluded those from the master_DF object. Having done this, the vast majority of extraneous rows are removed (there are still a few left in the data frame that I would have to remove manually, but given their low frequency and the size of data set I do not think those should affect the analysis).

In the last notebook, devoted to statistics and analysis, I extracted hapax legomena from the master data frame and created new summary data frame objects including token counts, type counts, hapax legomena counts, P, and P* for the whole data set and for each variety. Finally, I plotted differences across and within varieties. For the final submission I also removed extraneous cells and fixed the code on cells that were returning warnings.

5. Exploration

The following are visualizations of the distribution of the data. You can find tables providing specific figures for each diminutive by POS and variety in the cleaning and exploratory analysis notebook.

Figure 1 1429012

Figure 1 above illustrates the token counts for each diminutive across the entire data set. As expected, -ito is far more numerous. Its 1,195,810 tokens account for 84% of the total, as compared to 233,202 tokens (16%) for -illo. Token counts, however, are only useful for a preliminary estimation of differences in size given that productivity can vary diachronically and a category with a high token count might be mostly unproductive in the present day (e.g., -ment in English). It is, nevertheless, a good point of departure.

Figure 2

Figure 2 above illustrates the ratio of each diminutive suffix by POS. I use ratios rather than counts for this plot and the following one given the imbalance between categories. It appears that there's a bigger gap in usage depending on the token's category. -illo appears to be less common with adjectives. For nouns, the proportion amounts to 18% of the tokens, whereas for adjectives it is only 8%.

Lastly and most importantly, the ratio of diminutive suffixes used by variety. Below is a reminder of the varieties considered in the order that they appear followed in the plot:

  • Argentina: AR
  • Bolivia: BO
  • Chile: CL
  • Colombia: CO
  • Costa Rica: CR
  • Cuba: CU
  • Dominican Republic: DO
  • Ecuador: EC
  • Spain: ES
  • Guatemala: GT
  • Honduras: HN
  • Mexico: MX
  • Nicaragua: NI
  • Panama: PA
  • Peru: PR
  • Puerto Rico: PR
  • Paraguay: PY
  • El Salvador: SV
  • United States: US
  • Uruguay: UY

Figure 3

As can be seen in the chart, the ratio is similar for most countries other than Spain and (interestingly) Costa Rica, which are the only varieties where -illo eclipses the 20% threshold (21 and 23%, respectively). I expected the result for Spain given that in descriptive work it is claimed that the suffix is more common there, but I did not expect Costa Rica not only to have a similar rate but actually surpass it. Again, since these ratios are based on tokens, caution must be used, but it is one of the first surprising results.

6. Analysis

I will now start with an analysis proper, focused on statistical measures of productivity. To begin, the table below summarizes statistics for the entire data set. You can also find the same figures by variety in the statistics and analysis notebook.

Diminutive Tokens Types Hapax P P*
-illo 233202 13157 6513 2.79286 0.121974
-ito 1195810 48930 26611 2.22535 0.498367

In terms of realized productivity, as indicated by the type count, -ito is the larger category out of the two considered. This is visualized in Figure 4 below.

Figure 4

Turning to expanding productivity, as indicated by the hapax legomena count, -ito is attracting more members than -illo. This is visualized in Figure 5 below.

Figure 5

Lastly, potential productivity, as measured by the category-conditioned degree of productivity (P) or the hapax-conditioned degree of productivity (P*). The category-conditioned degree of productivity doesn't appear to fully capture the difference between these two suffixes in terms of occasionalisms, given that both suffixes appear to have around the same standing with -illo showing a slight advantage. This might be due to the fact that -illo has notoriously fewer tokens, so the comparison is not completely fair. Alternatively, the hapax-conditioned degree of productivity shows a far more noticeable difference, with -ito having a score that is fourth times larger than that of -illo.

Lastly, Figures 6 and 7 below illustrate the category-conditioned and hapax-conditioned degree of productivity by variety. Given that varieties are only being compared to themselves, the bars are parallel to each other; i.e., they are not affected by imbalances in corpus size.

Figure 6

Figure 7

Within varieties, the numbers appear to follow the same trend as those of the master data frame. A few differences are worth noting, however. Whereas P in the master data set shows both suffixes in similar standing, the differences here are higher in favor of -illo in some cases, particularly in countries such as Peru where it almost doubles -ito. For P*, however, -ito remains the prevailing suffix by a long margin in all varieties, although differences vary by country.

To summarize and in responses to my research questions:

  1. The results of my analysis indicate that, in terms of distribution, -ito is the larger morphological category as shown by the token, type, and hapax legomena counts not only at the cross-dialectal level but also zooming in on each country. I believe the results are important because to the best of my knowledge there aren't numbers to back claims that have been made in the literature about the prevalence of each suffix and because the same model can be replicated for other competing morphological patterns.
  2. When turning to statistical measures of productivity, a surprising result is that the category-conditioned degree of productivity shows both suffixes at the same standing. A caveat of such a measure is corpus size, which is important to consider in this project given that the subsets by variety vary widely in size, with Spain's being by far the largest. Given the uniformity of the results when using the hapax-conditioned degree of productivity, I would argue that it is a better gauge of the synchronic productivity of a morphological pattern, especially when there are imbalances in the data.
  3. Lastly and of particular importance for morphological theory, the results show that although the by-variety trends generally mirror those seen cross-dialectally, there are exceptions noticeable upon closer inspection. In Spain and Costa Rica, for instance both raw counts and statistical measures indicate an (unexpected for the latter) higher prevalence of -illo. Productivity hence might be affected not only by system-internal factors but also system-external factors. In a future iteration of this project, I would like to examine the differences using regression models (mixed-effects logistic regressions, for instance) and, on the more theoretical side, explore what diachronic research has to say regarding the evolution of these suffixes in general but in particular in the varieties where it appears to go against cross-dialectal trends.

References

Baayen, H. (2009). Corpus linguistics in morphology: morphological productivity. In A. Lüdeling & M. Kyto (Eds.), Corpus Linguistics. An international handbook (pp. 900-919). Berlin: Mouton De Gruyter.

Davies, Mark. (2016). Corpus del español [online corpus]. Retrieved from 〈https://www.corpusdelespanol.org〉 (28 January, 2020).

Haspelmath, M., & Sims, A. D. (2010). Understanding morphology (2nd ed.). London, UK: Hatchette.