Big data is watching you. And you might want to get used to it, because its reach is getting wider.
These days, it’s common knowledge that governments, security agencies and even social media companies are monitoring people’s behaviour on the internet.
The scale of this digital surveillance was revealed when Edward Snowden leaked classified information from the NSA in mid-2013. The thousands of documents informed the general public that global surveillance programs were being conducted by the NSA, along with other Five Eyes nations.
The sharing of information across international borders is nothing new. The Five Eyes intelligence alliance was established under the UKUSA Agreement back in 1946. The alliance is comprised of the USA, UK, Canada, New Zealand and Australia.
However, what has changed is technology. With the coming of the internet and people’s increasing reliance upon it, the amount of data being obtained and stored about individuals is growing. And no one is quite certain about the implications of leaving these data trails behind you.
The concept of modern surveillance began with Jeremy Bentham’s 18th century panopticon: a prison design that allows authorities to control inmates’ behaviour, as detainees are aware they may be monitored from a central watch tower at any given moment
In the 20th century, CCTV cameras emerged as the most overt form of surveillance. When people are aware of the presence of these cameras, they’re more likely to act in accordance with societal norms, as authorities may be watching or recording them.
But things have changed with the advent of cyberspace. Even though people are aware of digital surveillance, they put it out of their minds much of the time and carry on as normal.
You’re life’s in a databank
In April this year, the complete version of the Australian government’s mandatory metadata regime came into effect. Telcos and ISPs are now required to store the metadata of their customers – including the time and date of calls, emails, text messages and internet sessions – for a period of two years.
Currently, warrantless access to this data is reserved to 21 law enforcement agencies led by ASIO – but access can and has been granted to many other organisations upon request.
And the government wants more. Last month, Malcolm Turnbull announced proposed new laws that will require social media and technology companies like Facebook and Google to allow Australian security agencies access to people’s encrypted messages.
While this is all being done on the pretext of terrorism, the moves pose a very real threat to the civil liberties of all citizens.
In March this year, Sydney’s Inner West council replaced locals’ municipal bins with new ones fitted with radio frequency identification devices. The council did little to make residents aware of this.
These devices only collect a minimal amount of data, but they’re part of what’s called the internet-of-things, which is an increasing amount of everyday objects – such as cars and household appliances – connected to the internet.
And there are growing concerns that linking all this data together could pose significant dangers to individuals.
David Vaile is the co-convenor of the Cyberspace Law and Policy Community at UNSW. Sydney Criminal Lawyers® spoke to Mr Vaile about the nation’s metadata regime, the current assault on encryption, and the dangers of combining policing methods with machine learning.
Firstly, David, how pervasive is the digital surveillance being carried out by the government in this country? And should Australians be concerned?
It’s potentially very pervasive. We’ve seen the metadata retention laws introduced. Although, the term metadata is not in the legislation, that’s the popular term for it. It requires two years’ retention of a very large array of metadata under the justification that there’s a difference between content and metadata, or telecommunications traffic information.
That’s based on the traditional telephone interception observation that the metadata is potentially interesting and significant. You look at the TV series The Wire and all the effort that’s expounded around that and the significance of that information. But, it tells you way less than what is in the sound waves: the substance of the conversation, being able to identify the actual speaker and that sort of thing.
It was justified as somehow different from a warrantless mass surveillance scheme, but in fact, that’s what it is.
It’s similar to what you have in the US and in Britain. Under the Five Eyes intelligence community, there’s also the question in Australia of which other foreign governments and agencies may be able to do this sort of thing as well. You’d have to imagine that there are very cooperative arrangements between the Five Eyes countries.
In America, you’ve got the Foreign Intelligence Surveillance Act that’s directed toward the rest of the world. And what the Snowden materials revealed was that in the US they really had a very relaxed interpretation of the foreigner rule. If there was any link or likely link with someone outside of the country that was relaxed and those non-US persons essentially had no rights.
It turns out that the British treated Americans as non-UK citizens. With the British law the legal protections are on a national basis, rather than an international or multinational basis. They were able to expose US citizens to the sort of surveillance that the US entities couldn’t do themselves.
There may be programs the Australian government has in place that require a warrant – for instance, for the actual interception of content – that some other participants can bypass. So the query is whether that goes around in a big network, or it can be used to break those national protections.
So you’ve got the telecommunications metadata retention that’s done in this country by requiring telcos to retain it. Whereas in the US and other countries, government entities were doing the retaining. Here it’s sold as a benefit so it is at arm’s length of the government. But the fact is, in many cases you don’t need a warrant to access it, so it effectively just means that consumers themselves pay for it, as the cost has been imposed on the telcos, rather than it being a government cost.
There’s the capacity for traditional phone taps and content based interception of communications. That’s quite broad. It has somewhat greater constraints around it. In some cases warrants are required. Although, one of the ongoing legal questions in this area is the concept of the general warrant.
The US Fourth Amendment says you’ve got to specify the time, the place and the effects of a search warrant. You can’t say like the King used to in some European countries, “I pass a warrant that all my troops can look for anything they like at anytime and anywhere.”
That was called a general warrant. It’s in the same nature of a search warrant, but broader and vaguer. And in the Fourth Amendment it was one of the constitutional privacy protections. One of the controversies in the US was even where they did need a warrant, they were very broad and general in terms.
Rather than saying there’s a suspicion of these people and there’s a probable cause, therefore you’re welcome to tap their phones and their emails, it was covered in a very broad sort of way.
Last month, Turnbull announced the government is proposing new laws that will require companies to allow security agencies access to people’s encrypted messages.
What are the implications of granting access to encrypted social media information? And how could these laws be misused?
It’s an ongoing struggle against encryption. I had the pleasure of going to a discussion with Glenn Greenwald last Monday. One of the things that he pointed out was prior to the global publicity about the scope of the in some cases illegal and some cases legal mass surveillance was there was great cooperation already between the big players, including Google and Facebook. They didn’t have much encryption then and they were facilitating access into what they had.
After the revelation of the scale of it, questions about the legality and the justification arose. In particular when Google discovered even though they cooperated very broadly with legal requests and search warrants, the NSA had hacked into the private fibre optic line between two Google data centres. The engineers were shocked by that. Because they thought, “How far do we have to go? What level of cooperation do we have to give, before they won’t go around that?”
That sparked a public relations problem for those big players that were named in the documents that were leaked. And it drove a wedge between more or less open access and close cooperation to government agencies.
After that Google and other entities started using encryption much more broadly. One of the reasons encryption is an issue now is that it’s a reaction to the very broad, poorly regulated access that they had before the middle of 2013.
There is another reason why encryption is now widely used. When you go to a website the old-fashion protocol was ‘http,’ now almost everything is ‘https.’ And that requires a secure encrypted connection on the fly to the web server, so that you don’t get a man-in-the-middle attack. That’s where someone puts a relay between you.
With man-in-the-middle, you can track that information and work out what’s going from your web browser to the server, but you can also change it. That sort of mechanism is used for the proliferation of internet and IT security threats: the hacking, the nation-state attacks and ransomware.
It’s turned out the scope, the danger and the unstoppability of the increasing internet security threats has also pushed the effectiveness of existing protections to the brink and it’s prompted encryption to be more widely used.
In the same way, the wide public awareness after the Snowden revelations of the scale of the very poorly regulated mass government surveillance across the planet triggered the interest in encryption.
In recent years, the constant data breaches and perimeter security breaches – and also the tendency of national security agencies to want to break IT security, rather than protect it – have provided a second strong reason for widespread usage of encryption.
Now they say baddies and terrorists use encryption. In a sense that’s true. But the thing is banking, e-commerce and financial services use encryption. Ordinary secure communications on the internet use encryption. So you’ve now got a very controversial move with vague justifications and explanations of why it’s actually needed.
The reason why it’s now a controversy is not only does it represent potential for privacy and communication confidentiality intrusion, but it’s also a potential fundamental threat to IT security, all sort of e-commerce, and the fundamental resilience of the internet against the waves of ongoing attacks.
Sydney’s Inner West council provided locals with bins that actually had radio frequency identification devices (RFID) underneath the lids. The devices can only collect a minimal amount of data. But they’re part of the internet-of-things (iot).
What are the implications of having all these devices collecting data about a person?
There are two aspects of the internet-of-things. One is the massive increase in the volume of data. It feeds into what’s known as big data, which is the idea that we don’t have data collected for an individual purpose in controlled areas, but we’ve got a data lake.
It’s essentially an omnivorous urge to collect it all, which is the same motive behind the NSA’s overreach. In the end the US congress wound that back significantly. They discovered it didn’t work well. When you have too much information, you have the needle in the haystack problem. And in fact, it makes security worse.
The claim of the big data industries is that you need their new tools. The RFID chips on the bins is just the tip of the iceberg. The promoters of big data and of the internet-of-things are trying to increase the size of those data sets.
You then have what’s known as predictive analytics or machine learning or artificial intelligence methods, which are often very unreliable and discriminatory. They can use dirty, incomplete and inaccurate data.
They started off pitching ads to people. They’re probably OK for deciding what sort of ads you get. But, if you have insurance issues, or the council is treating you differently, and governments and insurers get access to that information, you’re potentially in trouble.
There’s also another trend called open data, where governments like to take all sorts of data. Some of it is not really risky, but they also take a lot of what was personal information and very lightly de-identify it. And they put that out in the open. It means that from a relatively innocuous data search you might get a bin cross-matched with big data analytics using an open data search.
The implications or the potential risk that’s projected onto you from when your bin was collected can be much more significant than it was 10 years ago, when none of this stuff existed.
The other side of it is the security side. The internet-of-things is embedded firmly in crappy, open-sourced software that typically never gets updated, unlike your mobile phone or computer, which are constantly discovered to be prone to new attacks and vulnerabilities. And they‘re constantly fixed. You’re constantly harassed to make sure you’ve got the latest version.
The business model of the internet-of-things doesn’t involve fixing it. So it means, as well as proliferating the amount of data they collect and distribute, they are also likely to be hackable.
Some of the recent service attacks have been of security cameras and other sorts of iot devices that have been harnessed by infectious malware and used to attack particular services.
Not only are they leaking data about you in ways that has implications that are very hard to appreciate now. They’re also possibly being hijacked by criminal or foreign state actors trying to do who knows what.
The big question about the internet-of-things is your personal information. From a surveillance perspective and also an IT security perspective it’s really quite dangerous. But the proponents don’t want to engage in that sort of discussion. They don’t want look at the risk. And they don’t want to look at the question of re-identification.
I’ve engaged in discussions or research with state and federal governments here and a couple of the governments of other countries. There is a dim awareness that some of these things might be a risk in the future, but they don’t show any responsibility for it. So they typically don’t audit. They don’t check. They don’t follow through or investigate, especially when it goes offshore. A lot of it goes into data havens in other countries.
From my perspective it’s a very poorly regulated, poorly governed and significant risk profile area. I’m very much a critic and a sceptic.
The University of Sydney’s Centre for Translational Data Science has been conducting research that utilises historical crime data, machine learning and applied statistical modelling with the hope of being able to predict future crime levels in certain areas.
How is big data being applied to law enforcement? And what are the dangers once mathematical algorithms are combined with policing?
My colleagues Lyria Bennett Moses and Janet Chan have written on this as part of the Data Decisions Cooperative Research Centre.
One of the problems is that a lot of the big data tools are derived from big data processing. They come from the Googles and the Facebooks that are mostly advertising entities. They don’t really care if they get things a little bit wrong. So they’re very tolerant of error.
When you’re making significant observations or conclusions about an individual in the area of traditional administrative law or criminal law, then those tools are very suspect, precisely because they are so tolerant of sloppy, dirty data.
They’re starting off from more traditional analytic tools and with people who are very sophisticated with methodologies and statistical inference. And also in identifying the range of ways that correlations can be misleading and irrelevant. They might be less problematic.
But, as far as I can tell, the wave of enthusiasm that is developing from the big data space tends to swamp what was seen as the smaller more old-fashioned and more careful methods. There’s a fundamental threat from that new culture that is much less concerned about the problems and the reliability of the data.
Then you’ve got other sort of questions about the data itself being skewed. You’ve already got police who more heavily police people of a certain race or a certain community. And a lot of the discussion about this has happened in the US, where the colour of your skin is a greater determinant – more so than your behaviours – of whether you get picked up by the police or shot or arrested.
So if you’ve got data sets that are derived from a system that is not neutral or statistically fairly sampled, and if the data reflects the existing policing practices that may have discriminatory assumptions embedded in them, and you feed that into a machine learning system, then they’ll say, “Oh well, you’re more likely to get trouble in Redfern.”
And if you’re talking about domestic violence the actual incidents of domestic violence are much more broadly spread. Using the available data, it’s very difficult to understand all of that.
Presumably, people with experience of criminology may find some ways of countering that. But, the other problem is that it’s very difficult once you start using these machine learning algorithms and other sort of intelligence tools – and the predictive analytics are often based on these – to present the data in an understandable way.
You can imagine them saying, “You can’t cope with the truth. You can’t cope with the volume of data that is leading to this. And you can’t cope with the complexity of the associational algorithms, or the nature of the associations. And we can’t reduce it to anything more comprehendible by humans. It is literally too big a data critique for you to understand.”
It’s not based on traditional things like causation or personal responsibility or direct connection. It’s much more based on associations and patterns.
In most cases these haven’t been designed to be capable of generating reports and outputs that enable a sceptical reviewer to have potential to find false and misleading things. As a field, it’s actually problematic.
In the sense of making risk assessments of individuals, this is even more so. Because in many cases it’s very hard to identify where it came from and whether it was a reasonable attribution. Because in many cases it hasn’t been designed to be expressed in ways that are open to appeal, review and correction.
But also, in many cases, the individual doesn’t realise that they’ve been profiled. They’ve got a greater risk of adverse criminal or government actions taken against them, because of the combination of the analytics of that crime more generally and their own sort of profile.
It’s very easy to get on a list and very hard to get off it.
David, thanks for taking the time out to have this chat with us today.