50.054 - Advanced Topics in Static Analysis¶
Learning Outcomes¶
- Apply Path Sensitive Analysis to Sign Analysis.
- Apply Static Analysis to detect software security loopholes.
Recall that sign analysis¶
The Sign Analysis that we developed in the previous class has some limitation.
graph
A["⊤"]---B[-]
A---C[0]
A---D[+]
B---E
C---E
D---E[⊥]
And the abstract value operators are defined as
-- | \(\top\) | + | - | 0 | \(\bot\) |
---|---|---|---|---|---|
\(\top\) | \(\top\) | \(\top\) | \(\top\) | \(\top\) | \(\bot\) |
+ | \(\top\) | \(\top\) | + | + | \(\bot\) |
- | \(\top\) | - | \(\top\) | - | \(\bot\) |
0 | \(\top\) | - | + | 0 | \(\bot\) |
\(\bot\) | \(\bot\) | \(\bot\) | \(\bot\) | \(\bot\) | \(\bot\) |
>== | \(\top\) | + | - | 0 | \(\bot\) |
---|---|---|---|---|---|
\(\top\) | \(\top\) | \(\top\) | \(\top\) | \(\top\) | \(\bot\) |
+ | \(\top\) | \(\top\) | \(\top\) | + | \(\bot\) |
- | \(\top\) | 0 | \(\top\) | 0 | \(\bot\) |
0 | \(\top\) | 0 | + | + | \(\bot\) |
\(\bot\) | \(\bot\) | \(\bot\) | \(\bot\) | \(\bot\) | \(\bot\) |
By converting the equation system into monotonic function
when we apply the fixed point algorithm to the f1
and the VarSign lattice, we have the following solution
At label 6, x
's sign is \(\top\). Such a problem exists in general as static analyses are approximation. Some point in the above analysis causes the result being overly approximated.
- Could it be due to the problem of how the abstract operators
--
and>==
are defined?- No, they are as best as we could infer given the variables (operands) are not assigned with concrete values.
-
Could it be due to the lattice having too few elements (abstract values)? No, it remains as top, even if we introduce new elements such as
+0
and-0
- Let's say we adjust the lattice
* and the abstract operatorsgraph A["⊤"]---A1[+0] A["⊤"]---A2[-0] A2---B[-] A2---C[0] A1---C[0] A1---D[+] B---E C---E D---E[⊥]
-- \(\top\) +0 -0 + - 0 \(\bot\) \(\top\) \(\top\) \(\top\) \(\top\) \(\top\) \(\top\) \(\top\) \(\bot\) +0 \(\top\) \(\top\) +0 \(\top\) + +0 \(\bot\) -0 \(\top\) -0 \(\top\) \(\top\) + -0 \(\bot\) + \(\top\) \(\top\) + \(\top\) + + \(\bot\) - \(\top\) - \(\top\) - \(\top\) - \(\bot\) 0 \(\top\) -0 +0 - + 0 \(\bot\) \(\bot\) \(\bot\) \(\bot\) \(\bot\) \(\bot\) \(\bot\) \(\bot\) \(\bot\) >== \(\top\) +0 -0 + - 0 \(\bot\) \(\top\) +0 +0 +0 +0 +0 +0 \(\bot\) +0 +0 +0 + +0 + +0 \(\bot\) -0 +0 +0 +0 0 +0 +0 \(\bot\) + +0 +0 + +0 + + \(\bot\) - +0 0 +0 0 +0 0 \(\bot\) 0 +0 +0 + 0 + + \(\bot\) \(\bot\) \(\bot\) \(\bot\) \(\bot\) \(\bot\) \(\bot\) \(\bot\) \(\bot\) - It does not help, as it might give
t
a more precise abtract value but it does not help to improve the result ofx
The real cause of the loss of precision is path insensitivity of the sign analysis, i.e. it does not exploit the fact that the path of going in the while loop body is only valid under the pre-condition x>=0
and the path of going out of the while loop is only valid under the condition x < 0
.
Path sensitive analysis via assertion¶
Supposed in the source language level, i.e SIMP, we include the assertion statement. For example, we consider the source program of PA1
in SIMP with assertion statements inserted in the body of the while loop and at the following statement of the while loop.
As we translate the above SIMP program in to Pseudo Assembly, we retain the assertions as instructions
We could add the following monotonic function synthesis case
- case \(l: assert\ t\ >=\ src\), \(s_l = join(s_l)[ t \mapsto gte(join(s_l)(t), join(s_l)(src))]\)
- case \(l: assert\ t\ <\ src\), \(s_l = join(s_l)[ t \mapsto lt(join(s_l)(t), join(s_l)(src))]\)
Where \(gte\) and \(lt\) are defined specifically for assertion instructions. The idea is to exploit the comparison operators to "narrow" down the range of the abstract signs.
gte | \(\top\) | +0 | -0 | + | - | 0 | \(\bot\) |
---|---|---|---|---|---|---|---|
\(\top\) | \(\top\) | +0 | \(\top\) | + | \(\top\) | +0 | \(\bot\) |
+0 | +0 | +0 | +0 | + | +0 | +0 | \(\bot\) |
-0 | -0 | 0 | -0 | \(\bot\) | 0 | -0 | \(\bot\) |
+ | + | + | + | + | + | + | \(\bot\) |
- | - | \(\bot\) | - | \(\bot\) | - | \(\bot\) | \(\bot\) |
0 | 0 | 0 | 0 | \(\bot\) | 0 | 0 | \(\bot\) |
\(\bot\) | \(\bot\) | \(\bot\) | \(\bot\) | \(\bot\) | \(\bot\) | \(\bot\) | \(\bot\) |
lt | \(\top\) | +0 | -0 | + | - | 0 | \(\bot\) |
---|---|---|---|---|---|---|---|
\(\top\) | \(\top\) | \(\top\) | - | \(\top\) | - | - | \(\bot\) |
+0 | +0 | +0 | \(\bot\) | +0 | \(\bot\) | \(\bot\) | \(\bot\) |
-0 | -0 | 0 | -0 | -0 | - | - | \(\bot\) |
+ | + | + | \(\bot\) | + | \(\bot\) | \(\bot\) | \(\bot\) |
- | - | - | - | - | - | - | \(\bot\) |
0 | 0 | 0 | \(\bot\) | 0 | \(\bot\) | \(\bot\) | \(\bot\) |
\(\bot\) | \(\bot\) | \(\bot\) | \(\bot\) | \(\bot\) | \(\bot\) | \(\bot\) | \(\bot\) |
To show that the above definitions of gte
and lt
are sound. We can consider the range notation of the abstract values.
\([l, h]\) denotes the set of integers \(i\) where \(l \leq i \leq h\). \(\bot\) is an empty range.
We can think of gte
as
Similiarly we can think of lt
as
where \(+\infty - 1 = +\infty\)
With the adjusted monotonic equations, we can now define the monotonic function f2
as follows
By applying the fixed point algorithm to f2
we find the following solution
which detects that the sign of x
at instruction 8 is -
.
Information Flow Analysis¶
One widely applicable static analysis is information flow analysis.
The information flow in a program describes how data are evaluated and propogated in the program via variables and operations.
The goal of information flow analysis is to identify "incorrect information flow". There two main kinds.
- Low security level data is being written into high security level resources, AKA tainted flow, e.g. SQL injection.
- High security level data is being sent to low security level observer. i.e, sensitive resource being read by unauthorized users, AKA, information disclosure.
Tainted Flow¶
IN this case, we say that the information flow is tainted if some sensitive information is updated / written by unauthorized users, e.g.
The above example was adapted from some online example showing what SQL injection vulnerable python code looks like. In this case we could argue that it is a tainted control flow as the untrusted user data is being used directly to access the sensitive resources.
When the id
is "' OR 'a'='a'; delete from customer_data; --"
, the malicious user gains the login access and deletes all records from the customer_data
table.
This can be prevented by using a prepared statement.
One may argue that using manual source code review should help to identify this kind of issues. The situation gets complicated when the program control is not simple
Let's recast the above into SIMP, we would have the vulunerable code as
We assume that we've extended SIMP to support string values and string concatenation. Theinput
is a function that prompts the user for input. The exec
function is a database builtin function.
The following version fixed the vulnerability, assume the sanitize
function, sanitizes the input.
To increase the level of compexlity, let's add some control flow to the example.
In the above, it is not directly clear that the exec()
is given a sanitized query. The manual check becomes exponentially hard as the code size grows.
We can solve it using a forward may analysis. We define the abstract domain as the following complete lattice.
graph
tainted --- clean --- bot("⊥")
We rewrite the above SIMP program into the following PA equivalent.
We define the equation generation rules as follows,
where \(pred(s_i)\) returns the set of predecessors of \(s_i\) according to the control flow graph.
The monotonic functions can be defined by the following cases.
- case \(l == 0\), \(s_0 = \lbrack x \mapsto \bot \mid x \in V\rbrack\)
- case \(l: t \leftarrow src\), \(s_l = join(s_l) \lbrack t \mapsto join(s_l)(src)\rbrack\)
- case \(l: t \leftarrow src_1\ op\ src_2\), \(s_l = join(s_l) \lbrack t \mapsto (join(s_l)(src_1)\ \sqcup \ join(s_l)(src_2))\rbrack\)
- case \(l: t \leftarrow input()\): \(s_l = join(s_l) \lbrack t \mapsto tainted\rbrack\)
- case \(l: t \leftarrow sanitize(src_1, src_2)\): \(s_l = join(s_l) \lbrack t \mapsto clean\rbrack\)
- other cases: \(s_l = join(s_l)\)
Let \(m\) be a map lattice object, and \(src\) be a PA source operand, the lookup operation \(m(src)\) for sign analysis is defined as follows
We inline the equations as comments in the PA code.
By applying the above we have the followning monotonic function.
By applying the fixed point algorithm, we find the following solution for the monotonic equations.
query
at instruction 8 is risky as query
may be tainted at this point.
Sensitive Information Disclosure¶
In this case of incorrect information flow, sensitive information may be written / observable by low access level agents, users or system, directly or indirectly.
In the above code snippet, we retrieve the username and password from a HTTP form request object and create a user record in the database table. The issue with this piece of code is that the user password is inserted into the database without hashing. This violates the security policy of which the user password is confidential where the database table user
is having a lower security level, since it is accessed by the database users.
In the above modified version, we store the hashed password instead, which is safe since it is hard to recover the raw password based on the hashed password. The hash
function here serves as a declassifier that takes a high security level input and returns a lower secirity level output.
We can analyses this kind of information flow that will accept the second snippet and reject the first one. The idea is nearly identical to the tainted analysis, except,
- We will have a different lattice for security level.
graph secret --- confidential --- bot("⊥")
where secret
has a higher level of security than confidential
.
- Instead of using
sanitize
to increase the level of security, we usehash
(ordeclassify
) to lower the level of security.