A long-standing challenge in analyzing information leaks within
mobile apps is to automatically identify the code

operating on sensitive data. With all existing solutions relying on
System APIs (e.g., IMEI, GPS location) or features of user
interfaces (UI), the content from app servers, like user’s Facebook
profile, payment history, fall through the crack.

In this talk, I will introduce ClueFinder, a novel semantics-driven
solution for automatic discovery of sensitive user data, including
those from the server side. ClueFinder utilizes natural language
processing (NLP) to automatically locate the program elements
(variables, methods, etc.) of interest, and then performs a
learning-based program structure analysis to accurately identify
those indeed carrying sensitive content. Using this new technique,
we analyzed over 400k popular apps, an unprecedented scale for this
type of research. Our findings brings to light the pervasiveness of
information leaks, and the channels through which the leaks happen,
including unintentional over-sharing across libraries and
aggressive data acquisition behaviors.