Researching the public web

by Mike Thelwall,
Statistical Cybermetrics Research Group,
University of Wolverhampton, UK.

I would like to make the case that academic researchers should not have any restrictions placed on the kinds of (legal) data that they investigate on the public web. In particular, researchers should be allowed to investigate personal information in the public web, such as social network site profiles, without considerations of informed consent (Wilkinson & Thelwall, in press). The main restriction that should be considered is the need for protecting identities when publishing results.

The background to this claim is that there is much research that primarily looks at the content of part of the public web: blogs, social network sites (SNSs), other Web 2.0 sites or more traditional web publishing. This research varies from small-scale qualitative studies, such as analysing the ethical orientations of a few diary bloggers (Hookway, 2008), to large-scale quantitative investigations such as maps of the interlinking of hundreds of bloggers (Adamic & Glance, 2005) or even automatic text analyses of tens of millions of SNS members (Kramer, 2010) or bloggers (Dodds & Danforth, in press; Gruhl, Guha, Liben-Nowell, & Tomkins, 2004). Whilst for humanities research, the web can be seen as a collection of documents and in science the web may be regarded as public data that can be fairly researched, a deeply ingrained social science ethical procedure for research using human subjects is the need to gain informed consent, not just for the right to collect the data but also for its uses (e.g., Heath, Brooks, Cleaver, & Ireland, 2009). This consent does not seem to have been gained for much web research, including probably the vast majority of quantitative research, yet some ethics committees block web research if it is not possible or practical to obtain consent from the people whose data is studied.

A simple but strong argument for researching published information on the public web without consent is that the object investigated is the publication and not the person. Therefore human subject standards do not apply (Bassett & O’Riordan, 2002; Enyon, Schroeder, & Fry, 2009; Ess & AoIR-Ethics-Working-Committee, 2002; Hookway, 2008; White, 2002).

I would next like to make the case first that researching people’s personal information does not violate their privacy if the information researched is on the public web.

From the huge literature on privacy definitions, laws and ethics I would like to pick out Moor’s distinction between normative and natural privacy (Moor, 2004). Normative privacy covers situations in which it is reasonable to expect others to protect privacy, whether by law or custom. A person’s home is a typical normatively private situation but this may also extend to personal information reported to a bank or government department. In contrast, natural privacy covers cases when a person might expect to be private, despite the lack of others to defend this situation. For instance, somebody sunbathing on a public beach on a remote island might expect privacy simply because they believe that nobody else is likely to come along. The difference is important because of the implications of breaching privacy in both cases. In a normatively private situation it would be reasonable to describe privacy as violated if it is breached. Repercussions should be expected following such a privacy violation. In contrast, breaching a naturally private situation is not a violation but an accident from the perspective of the person involved. For example, should a hiking party tramp across the sunbather’s remote beach, this may be an unpleasant surprise but there is no reasonable cause for complaint: privacy has been breached but not violated (Moor, 2004).

The public web is a situation that someone posting information may believe to be private but could not reasonably believe to be protected. Thus even if someone posts personal information to a public SNS profile or blog, expecting it to be seen only by their friends, they still do not have a reasonable cause for complaint about their privacy being breached when others find the information, even if an employer sacks them as a result. Hence the public web is a naturally private situation and there is no need to protect the public from breaches of perceived privacy by researchers analysing their published content.

A position that is somewhat similar to Moor’s but gives an additional useful perspective is Nissenbaum’s (2009) theory of contextual integrity, which is developed from a combined legal and ethical perspective. She argues that context rather than law or precise definitions of privacy is important when judging an issue of potential privacy violation or predicting people’s response to a new action. Previous cases exist when there has been significant resistance to the use of public web data or re-use of private web data. An important example is Facebook’s introduction of the news feed feature that broadcasts members’ actions to their friends, actions that friends would otherwise have to actively seek out (Nissenbaum, 2009). This was a breach of contextual integrity in the sense that information available in one context (browsing) was delivered in a different and more intrusive context (news feeds). A similar argument could be made for processing public data. If it is intended for use in one context (e.g., being read by friends) but is used for another (academic research) then this breaks contextual integrity and risks a reaction from the individuals involved. A defence against this claim, however, is that the new context is also a normal one. Academic researchers can use the fact that commercial processing of data on the public web occurs on a large scale and is intrusive because it is used to target advertising directly (Zimmer, 2008) or indirectly by analysing market segments (Gandy, 1993). In consequence, academic research is less breaching a private context than quietly walking through a busy street of commercial analysers. Perhaps most people do not realise the extent to which they are subject to commercial surveillance via the web, but its existence means that it is unreasonable to complain about academic research on the grounds of breach of contextual integrity.

In summary, the three points above make the case that researching the public web should not be subjected to ethical scrutiny for privacy concerns. Whilst this is probably uncontroversial in science and perhaps also the humanities, it seems to be an important statement to make in some social sciences.


Adamic, L., & Glance, N. (2005). The political blogosphere and the 2004 US election: Divided they blog. WWW2005 blog workshop, Retrieved May 5, 2006 from:

Bassett, E. H., & O’Riordan, K. (2002). Ethics of Internet research: Contesting the human subjects research model. Ethics and Information Technology, 4(3), 233-247.

Dodds, P. S., & Danforth, C. M. (in press). Measuring the happiness of large-scale written expression: Songs, blogs, and presidents. Journal of Happiness Studies.

Enyon, R., Schroeder, R., & Fry, J. (2009). New techniques in online research: Challenges for research ethics. 21st Century Society, 4(2), 187-199.

Ess, C., & AoIR-Ethics-Working-Committee. (2002). Ethical decision-making and Internet research. Recommendations from the aoir ethics working committee. Retrieved April 17, 2008 from:

Gandy, O. (1993). The panoptic sort: A political economy of personal information.Boulder, CO: Westview Press.

Gruhl, D., Guha, R., Liben-Nowell, D., & Tomkins, A. (2004). Information diffusion through Blogspace. Paper presented at the WWW2004, New York, Retrieved July 5, 2010 from:

Heath, S., Brooks, R., Cleaver, E., & Ireland, E. (2009). Researching young people’s lives.Thousand Oaks, CA: Sage.

Hookway, N. (2008). Entering the ‘blogosphere’: some strategies for using blogs in social research. Qualitative Research, 8(1), 91-113.

Kramer, A. D. I. (2010). An unobtrusive behavioral model of “Gross National Happiness”. Proceedings of CHI 2010, 287-290.

Moor, J. H. (2004). Towards a theory of privacy for the information age. In R. A. Spinello & H. T. Tavani (Eds.), Readings in CyberEthics (2nd ed., pp. 407-417). Sudbury, MA: Jones and Bartlett.

Nissenbaum, H. (2009). Privacy in context: Technology, policy and the integrity of social life.Stanford, CA: Stanford University Press.

White, M. (2002). Representations or people? Ethics and Information Technology, 4(3), 249-266.

Wilkinson, D., & Thelwall, M. (in press). Researching personal information on the public web: Methods and ethics. Social Science Computer Review.

Zimmer, M. (2008). The gaze of the perfect search engine: Google as an infrastructure of dataveillance. In A. Spink & M. Zimmer (Eds.), Web search: Multidisciplinary perspectives (pp. 77-99). Berlin: Springer.