Content analysis

Content analysis is the study of documents and communication artifacts, which are defined as texts. Examples of texts include photographs, speeches, and essays. Social scientists employ content analysis as a method of examining patterns in communication in a replicable and systematic manner. One of the key advantages of using content analysis to analyse social phenomena is their non-invasive nature, in contrast to simulating social experiences or collecting survey answers.

Practices and philosophies of content analysis vary between academic disciplines. They all involve systematic reading or observation of texts or artifacts which are assigned labels (sometimes called codes) to indicate the presence of interesting, meaningful pieces of content. By systematically labeling the content of a set of texts, researchers can analyse patterns of content quantitatively using statistical methods, or use qualitative methods to analyse meanings of content within texts.

Computers are increasingly used in content analysis to automate the labeling (or coding) of documents. Simple computational techniques can provide descriptive data such as word frequencies and document lengths. Machine learning classifiers can greatly increase the number of texts that can be labeled, but the scientific utility of doing so is a matter of debate. Further, numerous computer-aided text analysis (CATA) computer programs are available that analyze text for predetermined linguistic, semantic, and psychological characteristics.

Qualitative and quantitative content analysis

Quantitative content analysis highlights frequency counts and statistical analysis of these coded frequencies. Additionally, quantitative content analysis begins with a framed hypothesis with coding decided on before the analysis begins. These coding categories are strictly relevant to the researcher's hypothesis. Quantitative analysis also takes a deductive approach. Examples of content-analytical variables and constructs can be found, for example, in the open-access database DOCA. This database compiles, systematizes, and evaluates relevant content-analytical variables of communication and political science research areas and topics.

Siegfried Kracauer provides a critique of quantitative analysis, asserting that it oversimplifies complex communications in order to be more reliable. On the other hand, qualitative analysis deals with the intricacies of latent interpretations, whereas quantitative has a focus on manifest meanings. He also acknowledges an "overlap" of qualitative and quantitative content analysis. Answers to open ended questions, newspaper articles, political party manifestos, medical records or systematic observations in experiments can all be subject to systematic analysis of textual data.

By having contents of communication available in form of machine readable texts, the input is analyzed for frequencies and coded into categories for building up inferences.

Computer-assisted analysis can help with large, electronic data sets by cutting out time and eliminating the need for multiple human coders to establish inter-coder reliability. However, human coders can still be employed for content analysis, as they are often more able to pick out nuanced and latent meanings in text. A study found that human coders were able to evaluate a broader range and make inferences based on latent meanings.

Reliability and validity

Robert Weber notes: "To make valid inferences from the text, it is important that the classification procedure be reliable in the sense of being consistent: Different people should code the same text in the same way". The validity, inter-coder reliability and intra-coder reliability are subject to intense methodological research efforts over long years. Lacy and Riffe identify the measurement of inter-coder reliability as a strength of quantitative content analysis, arguing that, if content analysts do not measure inter-coder reliability, their data are no more reliable than the subjective impressions of a single reader.

According to today's reporting standards, quantitative content analyses should be published with complete codebooks and for all variables or measures in the codebook the appropriate inter-coder or inter-rater reliability coefficients should be reported based on empirical pre-tests. Furthermore, the validity of all variables or measures in the codebook must be ensured. This can be achieved through the use of established measures that have proven their validity in earlier studies. Also, the content validity of the measures can be checked by experts from the field who scrutinize and then approve or correct coding instructions, definitions and examples in the codebook.

Kinds of text

There are five types of texts in content analysis:

written text, such as books and papers
oral text, such as speech and theatrical performance
iconic text, such as drawings, paintings, and icons
audio-visual text, such as TV programs, movies, and videos
hypertexts, which are texts found on the Internet

History

Content analysis is research using the categorization and classification of speech, written text, interviews, images, or other forms of communication. In its beginnings, using the first newspapers at the end of the 19th century, analysis was done manually by measuring the number of columns given a subject. The approach can also be traced back to a university student studying patterns in Shakespeare's literature in 1893.

Over the years, content analysis has been applied to a variety of scopes. Hermeneutics and philology have long used content analysis to interpret sacred and profane texts and, in many cases, to attribute texts' authorship and authenticity.

In recent times, particularly with the advent of mass communication, content analysis has known an increasing use to deeply analyze and understand media content and media logic.

The political scientist Harold Lasswell formulated the core questions of content analysis in its early-mid 20th-century mainstream version: "Who says what, to whom, why, to what extent and with what effect?". The strong emphasis for a quantitative approach started up by Lasswell was finally carried out by another "father" of content analysis, Bernard Berelson, who proposed a definition of content analysis which, from this point of view, is emblematic: "a research technique for the objective, systematic and quantitative description of the manifest content of communication".

Quantitative content analysis has enjoyed a renewed popularity in recent years thanks to technological advances, being fruitfully applied in mass and personal communication research. Content analysis of textual big data produced by new media, particularly social media and mobile devices has become popular. These approaches take a simplified view of language that ignores the complexity of semiosis, the process by which meaning is formed out of language. Quantitative content analysts have been criticized for limiting the scope of content analysis to simple counting, and for applying the measurement methodologies of the natural sciences without reflecting critically on their appropriateness to social science. Conversely, qualitative content analysts have been criticized for being insufficiently systematic and too impressionistic.

Latent and manifest content

Manifest content is readily understandable at its face value. Its meaning is direct. Latent content is not as overt, and requires interpretation to uncover the meaning or implication.

Uses

Holsti groups fifteen uses of content analysis into three basic categories:

make inferences about the antecedents of a communication
describe and make inferences about characteristics of a communication
make inferences about the effects of a communication.

He also places these uses into the context of the basic communication paradigm.

The following table shows fifteen uses of content analysis in terms of their general purpose, element of the communication paradigm to which they apply, and the general question they are intended to answer.

{| class="wikitable"

|+Uses of Content Analysis by Purpose, Communication Element, and Question

! Purpose

! Element

! Question

! Use

| rowspan=2| Make inferences about the antecedents of communications

| align=center| Source

| align=center| Who?

Answer questions of disputed authorship (authorship analysis)

| align=center| Encoding process

| align=center| Why?

Secure political & military intelligence
Analyse traits of individuals
Infer cultural aspects & change
Provide legal & evaluative evidence

| rowspan=3| Describe & make inferences about the characteristics of communications

| align=center| Channel

| align=center| How?

Analyse techniques of persuasion
Analyse style

| align=center| Message

| align=center| What?

Describe trends in communication content
Relate known characteristics of sources to messages they produce
Compare communication content to standards

| align=center| Recipient

| align=center| To whom?

Relate known characteristics of audiences to messages produced for them
Describe patterns of communication

| Make inferences about the consequences of communications

| align=center| Decoding process

| align=center| With what effect?

Measure readability
Analyse the flow of information
Assess responses to communications

| colspan=4| Note. Purpose, communication element, & question from Holsti. as adapted by Holsti. Thus, while content analysis attempts to quantifiably describe communications whose features are primarily categorical——limited usually to a nominal or ordinal scale——via selected conceptual units (the unitization) which are assigned values (the categorization) for enumeration while monitoring intercoder reliability, if instead the target quantity manifestly is already directly measurable——typically on an interval or ratio scale——especially a continuous physical quantity, then such targets usually are not listed among those needing the "subjective" selections and formulations of content analysis. For example (from mixed research and clinical application), as medical images communicate diagnostic features to physicians, neuroimaging's stroke (infarct) volume scale called ASPECTS is unitized as 10 qualitatively delineated (unequal) brain regions in the middle cerebral artery territory, which it categorizes as being at least partly versus not at all infarcted in order to enumerate the latter, with published series often assessing intercoder reliability by Cohen's kappa. The foregoing italicized operations impose the uncredited form of content analysis onto an estimation of infarct extent, which instead is easily enough and more accurately measured as a volume directly on the images. ("Accuracy ... is the highest form of reliability.") The concomitant clinical assessment, however, by the National Institutes of Health Stroke Scale (NIHSS) or the modified Rankin Scale (mRS), retains the necessary form of content analysis. Recognizing potential limits of content analysis across the contents of language and images alike, Klaus Krippendorff affirms that "comprehen[sion] ... may ... not conform at all to the process of classification and/or counting by which most content analyses proceed," suggesting that content analysis might materially distort a message.

Developing the initial coding scheme

The process of the initial coding scheme or approach to coding is contingent on the particular content analysis approach selected. Through a directed content analysis, the scholars draft a preliminary coding scheme from pre-existing theory or assumptions. While with the conventional content analysis approach, the initial coding scheme developed from the data.

Conventional process of coding

With either approach above, researchers may immerse themselves into the data to obtain an overall picture. A consistent and clear unit of coding is vital, with the choices ranging from a single word to several paragraphs and from texts to iconic symbols. Lastly, researchers construct the relationships between codes by sorting out them within specific categories or themes.