Though machine learning offers businesses amazing new opportunities, concerns about data breaches and security incidents require corporations to spend as much time securing cloud machine learning systems as they do deploying them. This primer is designed to educate technology leaders on the basic tools available to keep their information and algorithms safe in the cloud.
Introduction
Today, the vast majority of machine learning systems are deployed in a cloud environment. Not only are deployments hosted there, but model training, retraining, and fine-tuning are typically performed in the cloud. This is due to the ubiquity and affordability of cloud-based GPU machines offered through services such as AWS, Google Cloud, and Microsoft Azure. While the increasing usage of cloud servers has vastly increased access to affordable ML hosting and compute, it also opens up a number of vulnerabilities. Malicious users seeking to compromise user data and steal personal identifying information can exploit these vulnerabilities. Thus, it is important for companies seeking to leverage machine learning to understand the risks; they must take steps to mitigate the risks associated with cloud hosting and training ML models.
Universal Security Concerns Applied to Machine Learning
The very nature of the cloud means that basic security best practices must be employed, whether machine learning is involved or not. The following list includes the basics that must be covered, as well as the nuances that should be considered for applications involving machine learning.
User Authentication and Access Permissions
The first step in protecting cloud systems involves restricting access to servers such that only those with prior authorization can interface with the machine. This is a step that can be applied generally and is not restricted merely to machine learning systems. Typically, restricting access is as simple as creating a proper security profile. This profile will only open the pre-configured SSH port on each machine to the IP addresses of the approved users. This is easily done and often a first step in launching an instance on, for example, the AWS EC2 service.
Each SSH interaction with the server should also be required to be signed with a valid .pem key or similar. This key should be kept secret and distributed only to those with approved access. Those who do have access to the .pem file should take common-sense steps to protect it from unauthorized use. For larger development teams, IAM roles (or similar on other cloud platforms) should be utilized. This will grant each user only the minimal level of permissions required to perform their role. There should not be excess permissions granted to users who don’t need them.
Database Security
Since machine learning models require data to train, protection of the data source must also be considered – particularly if that data contains sensitive information. Usually, the best practice is to keep the database on a separate server from the machine learning model. The model can then pull that data as needed directly through a middle layer application without reading and writing directly from the database. Databases are preferred over text, CSV files, and similar sources as they can be secured with credentials.
Legal Regulations and Compliance
Those who hold certain types of data are required by law to enact very specific security and access control protections. For example, providers who host healthcare data are bound by HIPAA laws. Similar regulations apply to data in the financial services, education, government, and energy industries. These extra legalities must be taken into consideration for those developing cloud machine learning systems centered around such data.
Another aspect that must be taken into consideration is how the cloud provider delegates data warehousing when working with sensitive data on cloud systems. For example, certain providers might not give you fine-grained control over where your data is hosted; they might host your database on a server located in Singapore or China. This could be fine, but having it hosted on some international server might cause compliance issues concerning confidential data. Therefore, it’s important to use a cloud provider which allows you to specify the region you want your servers located. This is beneficial in other ways as well, such as reducing lags in communication between different pieces of your application.
Redundancy
Cloud security involves not just protecting data from malicious users but ensuring that data does not get lost or corrupted. Since cloud servers can experience failure, it is important to regularly make backups of data as well as host it on multiple machines. This concept is called redundancy. This way, if one of the servers goes down, another can step in to take its place. This is a common and well-accepted best practice in the world of distributed systems. Luckily, many cloud providers handle redundancy for you. However, it’s still a good idea to regularly make backups of your data and store it in a second location.
Logs and Audit Trails
If a cloud ML system is compromised, you’re going to want to have a record of what exactly went wrong. This way you can close the security loopholes that were exploited and get the system up and running again quickly. One of the best ways to aid in this process is to implement robust logging procedures within your application. Events such as a model being queried or accessed, data being pulled from the database, and the IP addresses of incoming requests should all be logged in log files so that they can be reviewed and audited if necessary.
The Machine Learning Toolkit for Security
Cloud providers offer developers an ever-expanding list of tools and approaches that can be utilized for applications. Here are two useful for most applications, followed by two more, specific to machine learning.
1. Containers, Docker, and Kubernetes
Cloud machine learning systems often host models in what are called containers. Containers are thin OS virtualization layers that package all of the code and libraries required to run a model. These containers are usually implemented using Docker, or similar software, and deployed using a container orchestration service such as Kubernetes. Kubernetes provides a great resource for securing containers and deployment clusters on their website here. Typical steps involve signing images, limiting OS privileges for container users, protecting and limiting network access to clusters, and implementing pod security policies.
2. Firewalls
The most basic type of malicious attack on a machine learning system is a brute-force attack. A typical case would involve malicious users mounting anywhere from thousands to millions of requests upon a cloud system to overwhelm the network/application and take it down. One of the best ways to protect against such an attack is to use a firewall. Firewalls rate-limit incoming requests and blacklist IPs that exhibit irregular behavior and are deemed to be malicious.
3. Federated Machine Learning
One way to address many of the aforementioned security risks is via the use of federated machine learning. In federated ML, models are trained and used in a decentralized manner with data samples distributed across many machines and closed off from one another. Because the dataset is split across servers, it is much more difficult to hack in and access the entire database. Only parameters of the machine learning model are shared across machines, never the data, and these parameters can be encrypted. The local data shards can also be encrypted and operated directly using schemes such as homomorphic encryption.
4. Differential Privacy
Differential privacy refers to a set of methods for protecting individual user data from being recovered from a database via attacks on a learning algorithm. There are many methods of performing differential privacy, but they all rely on the principle that a statistical algorithm should be sufficiently robust as to not be fundamentally changed by the loss of individual data points. By aggregating across many examples and inserting noise into the training process, differential privacy-based algorithms can achieve much the same results as their non-secured counterparts while protecting the information of single individuals from being discovered.
Security Threats Specific to Machine Learning
There are multiple different types of attacks that can be mounted directly against machine learning models. These are some of the most common types, with examples of each. There are many more variants of attack upon machine learning models than we’ll list here – far too many to cover in a single article. This gives a good overview of a few major classes of attack and the ways in which models are typically targeted.
Adversarial Example Attacks
Machine learning models are often sensitive to small perturbations in their inputs. This is particularly true of image classification and related models. One way attackers can “attack” a system is to take a valid input to that system and add some small, random noise to it.
For example, if they were to hack into a self-driving car, they could take the vehicle’s video feed of an upcoming stop sign and inject some small noise into the image. This noise would be undetectable to the human eye – for all intents and purposes, it would still appear to an observer as a normal stop sign. However, to the image classification model within the self-driving car system, this modified input would no longer look like a stop sign. It would be misclassified, causing the vehicle to not stop as required by law and potentially injure pedestrians or passengers. Adversarial examples are still an active area of research, although there do exist various methods for addressing these vulnerabilities to some extent.
Evasion Attacks
In this attack, malicious users can create fake examples to fool or evade detection by a machine learning system. For example, they can evade a text-based spam email classifier by attaching an image to the email and embedding the text within the image.
Data Poisoning
Data poisoning acts as a training time counterpart to adversarial attacks. In data poisoning, an adversarial image (or example, in the more general case) is presented to the model at training time. This image will look innocuous to the human eye and will be labeled normally at test time. However, it will actually be embedded in the distribution of a different class by the model. Thus, an attacker can cause the system to exhibit aberrant behavior; merely by making their data publicly available with the user none the wiser.
Model Inversion
Model inversion attacks aim to uncover private data that a model was trained upon. This is done simply by inverting the function that the model learns. Previously, such attacks were only viable with simple linear models whose inverse could easily be learned. However, recent research is showing that model inversion attacks are also possible against deep neural network models. This is a major concern, especially because many developers use publicly available models for transfer learning. In this case, since the model architecture is known a-priori to attackers, it is easier for them to invert the model and gain access to private data.
Conclusion
Many companies are only now beginning to realize the potential of machine learning. As they do, they will swiftly learn that their algorithms are as valuable as the data that they house (in some cases, potentially more so). As such, critical care must be taken to ensure the security of all of the inputs, models, and outputs of their organizations, in order to please customers and investors alike.