Know the Security and Privacy Implications of Conversation Data Retention

Recently, we talked about security considerations when generating and transporting conversation data (or any data for that matter), starting at the application level down to the storage platform where the data lives. But, what are the security and privacy concerns with this data at rest? What are the implications of hanging onto this data? These questions tiptoe around data retention and the implications of storing, archiving, and retrieving conversation data.

To help answer these questions and go deeper with some real-world examples, we’ll cover data retention through the lens of security and privacy in this post. This includes the various sources of conversation data, the reasons why you need to store these conversations in the first place, why it’s critical to have a conversation data retention policy, and finally, the latest best practices for conversation data retention.

Today’s Conversation Data Sources

The ways we collect conversation data today can take on many forms since the primary vehicle for communicating between individuals is in some form of language, whether it be via voice or video, or written text. There are also many forms of communication that occur visually, such as sign language, and ways of communication that aren’t written per se, but are more tactile in nature, such as braille.

For this post, we’ll focus on written and spoken communication because it covers the majority of use cases related to conversation data retention, but that isn’t to say the other forms aren’t important. In fact, it’s incredibly fascinating to witness all the work being done intaking these non-verbal forms of language and making them digital. For example, using image recognition to convert braille to text.

In spoken word, your input sources include video conference applications like CPaaS platforms (i.e., Zoom, Agora, Vonage, etc.), more traditional forms like Telephony via SIP, WebRTC or PSTN, or even recorded audio/video, such as a sales meetings, contact center interactions, or media like a podcast, Instagram video, or YouTube video. We often think of these things first because these are the mediums we consume most frequently and in overwhelming amounts today. We also cannot forget about written communications, including emails and Slack messages, and social media interactions taking place on platforms such as Twitter, Facebook, and LinkedIn.

All of these conversations in all these formats are being stored today, and it’s essential to understand what these mediums are used for and the content of the conversations. That brings us to our next topic…

Motivations for Storing Conversations

Now, the requirements for storing conversation data can vary based on your use case, business or industry, but generally speaking, the most common reasons for maintaining persistent storage of this data include:

Historical Context: Enables you to aggregate conversations over time to uncover patterns occurring within your business and then act on them.
Analytics: Different from observable patterns, conversations can yield quantitative data and insights, such as how often do things occur and when.
Data Labeling: Obtains metadata about your conversations and provides context about what is being said. These are things like referenceable entities, questions, action items, and other insights into your conversations.
Reinforcement Learning: Provides business domain-specific insights by feeding conversation data into your own AI/ML workflows.
Taking Action: Enables real-time identification of actions to take as a result of conversations, including automated follow ups and next steps to move conversations forward, such as a sales engagement or process.

Use Case: Healthcare

The healthcare and medical industries are under intense, ongoing pressure to protect patient data and privacy, and rightly so. This sector is seemingly at the intersection of most, if not all, of the most significant privacy concerns today. This includes protecting patient medical records and history, but also other areas, such as credit payments (look into Payment Card Industry Data Security Standard [PCI-DSS]) and 3rd party records/references from insurance companies. Adhering to compliance standards including PII, HIPAA, GDPR, and EDPA UK, for example, are the first line of defense for keeping our records and identity from becoming public. But, there’s more to do…

The first obvious line of defense to keeping your data private is keeping records generated by the medical institution itself private. That covers everything from new patient intake forms to medical charts to prescribed medications. The reason for this is that there are a lot of things that can be inferred when bad actors gain access to even a few medical records. For example, if John Doe is currently taking some form of penicillin, you could infer that John has some form of bacterial infection. Then the strength of that antibiotic could spell out the severity of that infection, and so on.

Keeping these records private becomes an ongoing challenge because medical institutions bound by HIPAA, for example, need to keep medical records for at least 6 years from the date of creation or in last effect, whichever is later. Maintaining those records confidentiality takes discipline over a great deal of time. The good news: There are easy out-of-the-box methods that Symbl.ai can provide to help, such as Redaction capabilities to hide personal identification information and protect patient privacy.

Now, it’s critical to understand conversation aggregation and connecting the dots of information to find relationships between seemingly unrelated things. As in the previous example, the relationship between John Doe’s medication and the severity of his illness. In this case, that happens to be negative because you’re talking about pulling on a string of information to unravel more details about an individual. The good news is that if you are embracing the idea of using enterprise conversation applications, this might be an easier issue to address because to protect relationships between data, you first need to identify the connection and then redact or obfuscate it.

Use Case: Social Media

Now, let’s take at look at a use case that’s dramatically different in terms of the industry’s goals and audience: Social media.

Social media is a fascinating medium. These days it seems like the objective is to get as many “likes” or “followers” as possible, but the reality is influencers are trying to build a brand around themselves or their interests. That’s usually achieved through creating an enormous amount of content. Why? There are many reasons, but if we stick to monetary reasons, it’s mainly for either affiliate sales, advertising, or paid content to make money. Understanding motivations helps us understand the business, and these businesses are primarily focused on consumerism, at least for those using it for a substantial amount of their income.

If we are talking about consumers, the audience is literally anyone and everyone, but since we’re talking about the interwebs, a good portion of that audience is younger, and therein lies the problem. In the United States, regulations like COPPA, or Children’s Online Privacy Protection Act, play an interesting role in what data can be collected and, more importantly, stored relating to minors under 13. In those cases, you want the TTL, or the expiration date, to occur instantaneously, thereby not collecting that information at all.

If we look beyond minors under the age of 13 and to those that are 14 and older, the information that can be collected and stored from social media interactions can drastically vary, depending on the domain or subject. If you aren’t subject to regulatory compliance, there are some minimal prerequisites required by your local, state, or federal government. In the worst case, those regulations tend to be the “better to be safe than sorry” self-imposed requirements. In the United States, this could be something like the Federal Information Security Management Act (FISMA), which requires data retention for 3 years, for example.

Let’s Talk About Retention Policies

Now that we have defined some examples of the types of conversation data to store, it’s important to create policies addressing those requirements. To do that, you need to look at the lifecycle of a single bit of information and the journey it takes. This is usually defined in the following stages:

Creation
Storage
Access
Archival
Deletion

The creation phase is pretty straightforward. We need information or input to do something within the business, so we ask the user for it. This minimally contains something representing a timestamp to record when this data was created.

Once we’re done using that information, we store that information usually using some database that at some point lands on disk. Why? We need some form of persistent storage to keep all of this information so that it can be reused and recalled later. Check out this recent post on Data Persistence for Conversation Applications to learn more on this specific topic, including best practices when storing data of any kind.

If your applications and business require saving data, you need to be able to recall that information quickly. There are entire software systems from SQL, NoSQL, graph, vector, etc., that devote their entire existence to storing and retrieving data, but the more relevant topic for this post is when your retention policy needs to delete this information. To do that, you need metadata containing when that information was last accessed. As noted above, many regulatory bodies require data to be deleted after a period of time from last access.

If older data is not particularly relevant to your business’s day-to-day operations, you might have a process that places this data into archival storage. Why? There are two main reasons: 1) you need to hang onto this data until you find it useful, but before a specific date to remain in compliance, so you can’t just delete it, and 2) primary storage is expensive. Suppose you have the budget and find a reason for keeping this information around because you know it will be accessed and useful at some point. In this case, you need to have physical storage strategies to be able to expand the storage capacity. Otherwise, you can migrate that data into some archival system where you no longer intend to access it unless there is some external motivation to do so. For example, providing evidence in a legal matter like a court case.

The final stage on the journey of data is deletion. There may be cases where you don’t need to archive data since it’s immediately deleted (i.e., COPPA). But, if you store any conversation data, you should have a strategy to remove it, if needed. Removal is a tricky word. Removal could mean leaving it in the archives, but you don’t actively perform any maintenance on it or… lose the references or anonymize the data so you can’t recall who it is, etc. In some cases, deletion means deletion because of regulatory compliance, and at other times… well, as the saying goes, if it ends up on the internet, it lives there forever.

Key Takeaways

In closing, here is the TLDR on conversation data retention policies:

Know what regulatory bodies and laws apply to your business.
EVERYTHING is collected these days, know the reasons why.
Know what classification your data falls into, including public, HIPAA protected, COPPA protected, etc.
Remember that EVERYTHING is stored in some form today and you need to know where your data is stored at all times.
Data lives on disk in different forms due to regulations, security, and even cost, know what types of storage your conversation data lives on.
Eventually, data is deleted, be sure you know when that happens and for which datasets.
Know what “deleted” truly means for your organization.

If you know the answer to these questions for your business, that’s a great starting point. If the process is documented, you’re half-way there. If there is a design specification and a way to visualize all of these buckets of data down to the leaf node of individual data, that is truly spectacular!

Being able to develop a conversation data retention policy for your business is instrumental in avoiding accusations of misappropriating data, risks when systems are compromised in data breaches, and even helping your bottom line to properly store your data using the most effective means based on your retention requirements. Usually, you only hear of failures in managing data retention policies when something catastrophic happens, and it ends up on the evening news or front pages in the form of personal data breaches or data loss because someone accidentally deleted something.

So, be proactive with your conversation data retention policy because it could avoid that 3am in the morning phone call from production support!

If you’d like to learn more, reach out to us today!

David vonThenen

Developer Advocate

David is a self-described Tech geek and Developer Advocate enabling others to process communications to derive conversation intelligence. David talks Kubernetes/containers, VMware virtualization, backup recovery/replication solutions, adaptors in hardware storage connectivity, and everything else that is tech!