Monday, November 18, 2013

Binaries and Process Tracing

A little bit about Linux Programs

The Linux ABI (Application Binary Interface) is used to bind an executable to its imported functions at runtime through several functions provided by the libc sysdeps and the linux linker (ld). For example, when a programmer writes code that contains a call to “printf”, the ABI is responsible for extracting a pointer (in the form of a memory address) from libc.so, then writing it into the executable's import table so that it can be called from the executable more practically. The Program Interpreter is a component that can be specified to the ABI for customized executable formats. All dynamically-linked linux applications have what is called an INTERP header (or .interp), you can see this using the command line utility readelf, like so:

user@host $ grep interpreter <(readelf -a $(which ls))
      [Requesting program interpreter: /lib64/ld-linux-x86-64.so.2]

Because I am using a 64-bit system for this demonstration, all of my dynamically linked binaries in my testing environment specify /lib64/ld-linux-x86-64.so.2, on a 32-bit system, executables will specify a 32-bit counterpart.

A little about process tracing

Process tracing in a linux environment can be performed using several different debugging tools-- namely strace, ltrace, ftrace, and interactive debuggers (such as gdb). While strace is an excellent tool for monitoring I/O and certain system calls, it falls short around shared object monitoring capability. That is why ltrace and ftrace were born: they are able to show the actual function calls as they occur from a process to shared objects (*.so files) imported by the executable. This allows administrators and programmers trying to debug issues with an application to determine where in its calls to shared objects things begin to go wrong. Process tracing and debuggers can also be helpful for malware analysis and detection. As such, attackers frequently target these utilities to find evasion methodology amongst other bugs (debugger exploits, anyone?)

Self-linking code

When I wrote the dynamic engine for shellcodecs, I implemented my own version of program interpretation. Why? Because there is no guarantee that a given executable will have required functions in its import table for shellcode to run properly. So, I wrote a piece of assembly code capable of parsing an ELF64 shared object to isolate pointers to the functions I wanted to call, similar to dlsym() from libdl. Recently I was entertaining the idea of writing an all-assembly rootkit, so I checked into how calls made by the shellcodecs engine were handled by different tracing methods. I put together a couple of programs to see how things got handled and what information actually got revealed by tracing the processes. I got some pretty interesting results.

Test programs and results

My test programs were relatively simple. Here is a normal set of C code that prints “ohai” and then calls exit(2), and its correlating ltrace output:

#include <stdio.h>
#include <dlfcn.h>
#include <stdlib.h>

int main(void) {
    printf("ohai");
    exit(2);
}

And its ltrace output:

user@host $ ltrace ./ltrace-test
__libc_start_main(0x400544, 1, 0x7fff70b45d88, 0x400570, 0x400600 
printf("ohai")                                                                               = 4
exit(2ohai 
+++ exited (status 2) +++

Notice the tracer caught the call to printf as well as the call to exit. It shows both exit(2) as well as "exited (status 2)". This is an important distinction for our next test:

#include <stdio.h>
#include <dlfcn.h>
#include <stdlib.h>

// Compile: gcc ltraced.c -o ltraced -ldl

int main(void)
{
    void *libc;
    int (*putstr)(char *);
    int (*exitp)(int);
    libc = dlopen("/lib/i386-linux-gnu/i686/cmov/libc.so.6",RTLD_LAZY);
    *(void **)&putstr = dlsym(libc,"puts");
    *(void **)&exitp  = dlsym(libc,"exit");
    putstr("ohai");
    exitp(2);
}

And its ltrace results:

user@host $ ltrace ./ltraced
__libc_start_main(0x400594, 1, 0x7fff36ae94b8, 0x400610, 0x4006a0 
dlopen("/lib/i386-linux-gnu/i686/cmov/li"..., ) = NULL
dlsym(NULL,"puts")                              = 0x7f400a7e0ce0
dlsym(NULL,"exit")                              = 0x7f400a7ab970
ohai
+++ exited (status 2) +++

Notice this time it didn't actually catch the call to exit or puts itself -- it only catches the calls to dlsym and dlopen -- but it doesnt catch the calls to puts() or exit() themselves. Reason being, puts() and exit() never appear in the binary's import table, as you can see with the following:

user@host $ objdump -R ./ltraced

./ltraced:     file format elf64-x86-64

DYNAMIC RELOCATION RECORDS
OFFSET           TYPE              VALUE 
0000000000600fe0 R_X86_64_GLOB_DAT  __gmon_start__
0000000000601000 R_X86_64_JUMP_SLOT  __libc_start_main
0000000000601008 R_X86_64_JUMP_SLOT  dlopen
0000000000601010 R_X86_64_JUMP_SLOT  dlsym

Implications and further testing

Since I realized ltrace was only capable of tracing functions in the executable's import table, I wondered if its possible to completely evade ltrace for called functions with an assembly application. The results were phenomenal.

user@host $ ltrace ./full_import_test 
__libc_start_main(0x400554, 1, 0x7fff92666938, 0x400690, 0x400720Successfully called puts without import
 
+++ exited (status 2) +++

I was able to get these results with the following assembly program:

.global main
.section .data
.section .bss

# MUST BE COMPILED:
# gcc full_import_test.s -ldl -Wl,-z,relro,-z,now -o full_import_test
libc_base:
    .align 8 
libdl_base:
    .align 8

.section .text

main:
  xor %rdi, %rdi
  mov $0x400130, %rbx
  mov (%rbx), %rcx
  add 0x10(%rbx), %rcx
  mov 0x20(%rcx, %rdi, 2), %rbx     # grab pointer to dlclose()

find_base:
  dec %rbx
  cmpl $0x464c457f, (%rbx)          # grab base of libdl
jne find_base

save_libdl:
  mov $libdl_base, %rdi
  mov %rbx, (%rdi)
  xor %rdi, %rdi

dlopen_libc:
  push $0x25764b07       # Function hash for dlopen()
  pop %rbp              

  mov $libc, %rdi        # libc.so.6

  push $0x01             
  pop %rsi               # RTLD_LAZY
  call invoke_function   # (%rax) = dlopen('libc.so.6',RTLD_LAZY);

save_libc:
  mov (%rax), %rcx
  mov $libc_base, %rax
  mov %rcx, (%rax)
 
jmp _world

 
################
#
#  Takes a function hash in %rbp and base pointer in %rbx
#  >Parses the dynamic section headers of the ELF64 image
#  >Uses ROP to invoke the function on the way back to the
#  -normal return location
#
#  Returns results of function to invoke.
#
invoke_function:
  push %rbp
  push %rbp
  push %rdx
  xor %rdx, %rdx
  push %rdi
  push %rax
  push %rbx      
  push %rsi
  push %rbp
  pop %rdi
 
  read_dynamic_section:
    push %rbx
    pop %rbp
 
   push $0x4c
   pop %rax
   add (%rbx, %rax, 4), %rbx
 
  check_dynamic_type:
    add $0x10, %rbx
    cmpb $0x5, (%rbx)
  jne check_dynamic_type
 
  string_table_found:
    mov 0x8(%rbx), %rax       # %rax is now location of dynamic string table
    mov 0x18(%rbx), %rbx      # %rbx is now a pointer to the symbol table.
 
  check_next_hash:
    add $0x18, %rbx
    push %rdx
    pop %rsi
    xorw (%rbx), %si
    add %rax, %rsi
 
    calc_hash:
      push %rax
      push %rdx
 
      initialize_regs:
        push %rdx
        pop %rax
        cld
 
        calc_hash_loop:
          lodsb
          rol $0xc, %edx
          add %eax, %edx
          test %al, %al
          jnz calc_hash_loop
 
      calc_done:
        push %rdx
        pop %rsi
 
      pop %rdx 
      pop %rax
 
  cmp %esi, %edi
 
  jne check_next_hash
 
  found_hash:
    add 0x8(%rbx,%rdx,4), %rbp
    mov %rbp, 0x30(%rsp)
    pop %rsi
    pop %rbx
    pop %rax
    pop %rdi
    pop %rdx
    pop %rbp
ret

# push hashes_array_index
# call fast_invoke
fast_invoke:
  push %rbp
  push %rbx
  push %rcx

  mov 0x20(%rsp), %ecx

  mov $libc_base, %rax
  mov (%rax), %rbx

  mov $hashes, %rax
  mov (%rax, %rcx, 4), %ebp

  # Registers required for link to work:
  # rbp - function hash
  # rbx - base pointer to lib
  call invoke_function

  mov 0x18(%rsp), %rcx # grab retptr
  mov %rcx, 0x20(%rsp) # kill the function argument
  pop %rcx
  pop %rbx
  pop %rbp
  add $0x8, %rsp
  ret

# freed developer registers: 
# rax rbp rbx rcx r11 r12 r13 r14 r15
#
# a libc call:
# function(%rdi,  %rsi,  %rdx,  %r10,  %r8,  %r9)
_world:
  mov $hiddenmsg, %rdi  # arg1
  push $0x1             # function array index in hashes label for puts()
  call fast_invoke      # puts("Successfully called puts without import")


  push $0x02            #
  pop %rdi              # arg1

  push $0x00            # array index in hashes label for exit()
  call fast_invoke      # exit(2);

  ret                   # Exit normally from libc, exit(0)
  #  after execution, echo $? shows 2 and not 0 ;)

force_import:
    call dlclose

libc: 
    .asciz "libc.so.6"

hashes:
    .long 0x696c4780, 0x74773750

hiddenmsg:
    .asciz "Successfully called puts without import"

And its import table does not contain dlopen(), puts(), or exit():

user@host $ objdump -R full_import_test

full_import_test:     file format elf64-x86-64

DYNAMIC RELOCATION RECORDS
OFFSET           TYPE              VALUE 
0000000000600ff8 R_X86_64_GLOB_DAT  __gmon_start__
0000000000600fe8 R_X86_64_JUMP_SLOT  __libc_start_main
0000000000600ff0 R_X86_64_JUMP_SLOT  dlclose

In this example, we call dlopen() on libc, then use the shellcodecs implementation of dlsym() to call functions. The tricky bit was getting dlopen() to work without showing up in the ltrace call. For this, I ended up putting a "call dlclose" at the end of the application in the force_import label (though it is never actually called or used). By compiling with full relro, I was able to use the pointer to dlclose from the GOT as a way to pivot back to the base of libdl, then re-parse its export table to traverse back to dlopen(). As a result, none of the shared objects opened by dlopen are noticed by ltrace or ftrace. Depending on your runtime environment and your compiler, the offset may be subject to change. The following line is responsible for extracting the dlclose pointer:

  mov 0x20(%rcx, %rdi, 2), %rbx     # grab pointer to dlclose()

If for some reason this code isn't working on your system, you can probably achieve your desired result by modifying the offset from 0x20 to either 0x18 or 0x28. This is a static offset assigned during compile time. We could also iterate over the string table to determine if we were grabbing the right pointer (e.g. make sure we are getting the pointer to dlclose), but that was not the purpose of these tests. So when it comes to binaries like this, strace (for now) is the only non-interactive option for tracing available, and it won't show you some of those shared-object calls that might be vital to your research.


Saturday, November 9, 2013

Development notes from Beleth: Multi-threaded SSH Password Auditor

Introduction

Beleth is a fast multi-threaded SSH password auditing tool. For a quick introduction to the tool and how to use it, head over to Blackhat Library.

Get the source

Beleth is available on github and will continue to be updated with new features. If you'd like in on the development, submit a pull request.

$ git clone https://github.com/chokepoint/Beleth.git
$ cd beleth
$ make

Multi-threaded design

There are a couple of different options available for developers when coming up with multi-threaded design on Linux based systems using C. Two of the most popular are fork() and pthread_create(). Fork() differs from pthread_create() in that address space is not shared between the parent and child threads. Instead, a complete copy of the parent's address, code, and stack spaces are created for the child process. In order to keep dependencies to a minimum, I decided to go with a standard fork design.

pid = fork();
if (pid < 0) {
    fprintf(stderr, "[!] Couldn't fork!\n");
    destroy_pw_list();
    exit(1);
} else if (pid == 0)  { /* Child thread */
    crack_thread(t_current);
    if (ptr != NULL)
        free(ptr);
} else {               /* Parent thread */
    ...
}

This is great, but we need a way to control the child processes that are running through the password list.

Inter-process Communication (IPC)

Again, there are many options for developers when it comes to IPC as well. Below is a list of only some of the available options.

  • Shared Memory
  • FIFOs
  • Half-Duplex Pipes
  • Full-Duplex Pipes
  • Sockets

We are using fork() so memory sharing is not an immediate option, unless we feel like mmap()ing a shared memory space for communication, but that can get messy. FIFOs and pipes would work for distributing the wordlist among threads, but in order to keep options open Beleth uses Unix Domain Sockets for all IPC. By designing IPC with sockets, it would be trivial to turn Beleth into a distributed cracking platform.

The task handling process binds to a socket file

int listen_sock(int backlog) {
 struct sockaddr_un addr;
 int fd,optval=1;
 
 if ((fd = socket(AF_UNIX, SOCK_STREAM, 0)) == -1) {
  if (verbose >= VERBOSE_DEBUG)
   fprintf(stderr, "[!] Error setting up UNIX socket\n");
  return -1;
 }
 
 fcntl(fd, F_SETFL, O_NONBLOCK); /* Set socket to non blocking */
 setsockopt(fd, SOL_SOCKET, SO_REUSEADDR, &optval, sizeof(int));
 
 memset(&addr,0x00,sizeof(addr));
 addr.sun_family = AF_UNIX;
 strncpy(addr.sun_path, sock_file, sizeof(addr.sun_path)-1);
 
 unlink(sock_file);
 
 if (bind(fd, (struct sockaddr*)&addr, sizeof(addr)) == -1) {
  if (verbose >= VERBOSE_DEBUG)
   fprintf(stderr, "[!] Error binding to UNIX socket\n");
  return -1;
 }
 
 if (listen(fd, backlog) == -1) {
  if (verbose >= VERBOSE_DEBUG)
   fprintf(stderr, "[!] Error listening to UNIX socket\n");
  return -1;
 }
 
 return fd;
}

Each cracking thread establishes a connection to the socket file in order to request the next password in the list, as well as tell the task handler when a correct password is found.

int connect_sock(void) {
 int fd;
 struct sockaddr_un addr;
 
 if ((fd = socket(AF_UNIX, SOCK_STREAM, 0)) == -1) {
  if (verbose >= VERBOSE_DEBUG)
   fprintf(stderr, "[!] Error creating UNIX socket\n");
  return -1;
 }
 
    memset(&addr,0x00,sizeof(addr));
    addr.sun_family = AF_UNIX;
    strncpy(addr.sun_path, sock_file, sizeof(addr.sun_path)-1);
    
    if (connect(fd, (struct sockaddr*)&addr, sizeof(addr)) == -1) {
  if (verbose >= VERBOSE_DEBUG)
   fprintf(stderr, "[!] Error connecting to UNIX socket\n");
  return -1;
 }
 return fd;
}

The protocol is simple and based on the following definitions located in beleth.h.

/* IPC Protocol Header Information */
#define REQ_PW     0x01 /* Request new password to try */
#define FND_PW     0x02 /* Found password */
#define NO_PW    0x03 /* No PWs left... cleanup */

To-do list

  • Add option for user name list
  • Add option for host list
  • Add simple port scanner and feed new IPs to the task handler
  • Add distributed cracking support

Wednesday, November 6, 2013

Local testing for executable overhead

The other day a friend of mine and I were discussing different types of overhead involved with different programming languages, and I used some simple comparisons to explain that compiled languages have lower overhead than interpreted languages. While it does not directly correlate to ram or processor usage (this can vary on a developer's code), it can give you a general idea of the overall efficiency of any specific language's implementation. We'll be comparing the disk usage and running time of a simple program, exit(0), written in a variety of languages.

Assembly

This is a very basic implementation of exit() using a linux system call.

.section .data
.section .text
.globl _start
_start:
    xor %rdi, %rdi
    push $0x3c
    popq %rax
    syscall

I saved the file as exit.s and assembled/linked it with the following commands:

$ as exit.s -o exit.o
$ ld exit.o -o exit

C

This is a very quick version of exit.c:

#include <stdio.h>
#include <stdlib.h>

int main (int *argc, char** argv) {
    return(0);
} 

I compiled this using the following:

$ gcc exit.c -o exit-c

Perl

Exit.pl is only 2 lines in length:

#!/usr/bin/perl
exit(0);

I compiled this using par packer (pp):

$ pp exit.pl -o exit-pl

Simple comparisons

Disk usage reveals:

$ du -sh exit exit-c exit-pl
4.0K exit
12K exit-c
2.4M exit-pl

That test includes slack space in its results. Lets find out what the actual byte counts of these files are, shall we?

$ wc -c exit exit-c exit-pl
    664 exit
   8326 exit-c
2474525 exit-pl
2483515 total

A timing test will show us:

$ time ./exit

real 0m0.001s
user 0m0.000s
sys  0m0.000s

$ time ./exit-c

real 0m0.002s
user 0m0.000s
sys  0m0.000s

$ time ./exit-pl

real 0m0.187s
user 0m0.100s
sys  0m0.020s

Interpreters

While the perl example is packed using par packer it might not be a fair comparison for a script. We can time that, along with ruby, python, and php while being interpreted by their interpreters:

$ time perl -e 'exit(0);'

real 0m0.005s
user 0m0.000s
sys 0m0.004s

$ time ruby -e 'exit'

real 0m0.008s
user 0m0.004s
sys 0m0.004s

$ time python -c 'exit'

real 0m0.024s
user 0m0.016s
sys 0m0.008s

$ time php -r 'exit(0);'

real 0m0.017s
user 0m0.008s
sys 0m0.008s

These timing tests can be used as a base indicator for the general performance of a given language, with assembly in the lead and C not far behind, its trivial to see that truly compiled or assembled languages are in fact faster than interpreters. These aren't perfectly fair comparisons because of several reasons:

  • Unused compiler/interpreter functionality overhead is included regardless of whether or not we use it in our code
  • Other processes running on the test system may cause things like timing to be unreliable (real cycle counting is much more reliable)
  • Actual CPU/Ram usage was never measured

Regardless of the fact that it isn't perfect, it should give you some idea of the difference in overhead/performance between the given interpreters on the test system, and certainly shows that in general, compiled/assembled languages run more quickly than interpreted languages. Of course, the performance of any application is partly to its programming; so while this may give you an idea of the language performance, it won't tell you how well any particular application written in a given programming language is going to run.

Saturday, November 2, 2013

PHP Database Programming: Introducing 'ormclass'

To begin with, I have a few problems with the traditional web stack. Suppose I wanted to write a feature-rich, user friendly web application -- this requires that I know at least five programming languages:

It doesn't seem right that we need five languages for a singular application, but that aside, the fact that SQL injection is still in the top ten reasons that anything is compromised is pathetic. We are in the year 2013, SQL injection shouldn't much exist anymore. SQL programming can also be a bit cumbersome, for multiple reasons. Enter the ORM. ORM's are designed for two purposes: making SQL data more easily accessible for a programmer from a chosen programming language, in addition to improving overall application security. The problem I have with most ORM's is simple: I still find myself having to write some form of sql-like statements -- even if it isn't traditional SQL itself. For example, in PHP's doctrine ORM, if I wanted to select an article by id 1, the syntax would look something like:

   $article = Doctrine_Query::Create()->select('*')->from('article')->where('id=?')->execute($id)->fetchOne();

The syntax may have changed since I last used Doctrine, but you can see there is still a lot of SQL-like code going on (even if its not direct SQL itself). In this case I have to ask, why didn't we just use the mysql PDO library? At this point, we've added a lot of extra bloat to the application in the form of doctrine ORM; yet we still find ourselves writing SQL (or something similar). For all of that code and RAM consumption, that's not much of an improvement for a developer who just wants to hack out a quick application.

So, I've made my own quick and dirty ORM (available at github). It automatically handles sanitizing for the developer, as well as automatically handling object mapping. Of course, this isn't the best ORM in the world (and I will never make that claim), but it certainly helps for getting some code out quickly and effectively. Its also very tiny. Many improvements can be made to its design, and I will continue to develop this off-and-on as needed for my own applications. The purpose is to effectively eliminate the need to write SQL during (simple) application development.

The ormclass needs a configuration file to be included before it. The configuration is expected to look like:

    $dbhost   = 'localhost';  //Database server hostname
    $database = '';           //Database name
    $dbuser   = '';           //Database username
    $dbpass   = '';           //Database password

    $dbl      = @mysql_connect($dbhost,$dbuser,$dbpass);
    @mysql_select_db($database,$dbl) or die("I'm not configured properly!");

Obviously, you'll have to fill those values in for yourself. I wanted an ORM that would let me do something like the following:

    $article  = new article($_GET['id']);
    # or 
    $article  = new article($_GET['title']);
    # or
    $articles = new article($array_of_ids);
    # or 
    $articles = new article($array_of_titles);
    # or 
    $articles = new article($nested_mixed_array_of_titles_and_ids);    

I also wanted to be able to simply assign properties to the object and save and delete it, or even create new objects. This would also need the capacity for searches, both exact and wildcard. This would (mostly) eliminate the need for writing actual SQL in my application, but also handle some of the tedium of sanitizing for me. Again, I'm aware that this can certainly be done better and if you'd like to contribute to the project, submit a pull request to github. This is a quick and dirty implementation of such an ORM, that allows the programmer some leeway to write logical code in stead of tedious code. There are definitely some places that need work. I've hacked out a version that uses the traditional MySQL library, and I'm working on a version that uses the MySQL PDO library.

The methods and features included in the library include a few subsets of SQL query tedium removal. The following methods are inherited by all classes extending the ORM's class:

  • __construct($arg = null)
  • search($property,$string,$limit = 10, $offset = 0)
  • search_exact($property,$value, $limit = 10, $offset = 0)
  • unsafe_attr($field,$value)
  • fetchAll()
  • fetchRecent($limit = 10)
  • delete()
  • save()

The constructor will automatically check to determine if a method called construct() exists in its child class. If so, it will invoke the function after it has preloaded all of the relevant data into the object. This is how relations can be maintained. Its a bit hackier than most ORM's (there's no configuration file in which you simply state the relations), but it gets the job done and allows the programmer to have control over whether or not relations are followed and child objects are created by default. The ORM requires that every table have an 'id' column. The 'name' column is optional. Here is an example relation:

    class article extends ormclass {
        function construct() {
            $this->author = new author($this->author_id);
        }
    }
  • In this example, you could later:
     $article = new article($id);
     echo $article->author->name; # or other author property.

When you want to create a new record, you can simply pass '0' as the ID for the object, and it will automatically have an ID on instantiation:

    $article = new article(0);

Alternatively, its possible to just call save after a null instantiation (you'd do this if you don't need it to have an ID for relation purposes before the object has attributes):

    $article = new article();
    $article->save();

Similarly to the constructor hook for construct(), there is also a hook for creation of a new record. If you wanted to do something when a new object is inserted into the database, you could add a function called creation() to the class, and it would be called any time a new record is created in the database.

The difference between unsafe_attr() and save() is relatively simple. If there is HTML allowed in a field, for example $article->body, then you'd want to use the unsafe_attr() function to save that particular field (save() will autosanitize against XSS). When using unsafe_attr(), because this uses the normal SQL library (and not PDO), you will need to make sure that your html contains exclusively single quotes or exclusively double quotes, it doesn't particularly matter which. The function does do checks to ensure you aren't using both to prevent sql injection, and returns false if both are in use. This bug is the primary reason I'm developing a PDO version separately (besides standards, we cant forget those). This ORM also has a performance/feature trade off. Because I wanted it to be able to handle nested arrays, the collection function runs an arbitrarily large amount of SQL queries. I can provide a version that doesn't do this (but will also be unable to handle nested arrays) on request, since I'm sure people will not want the performance hit; however because I am working on a PDO version, I'd rather make that a loader option in that rendition for how collections are handled. This also currently only auto-sanitizes strings and integers; better sanitizing will come in the PDO version (hence my describing this as "Quick and Dirty").

This ORM does not have any scaffolding. This means that you will have to create the database and the associated tables yourself before this ORM can access the data. It does not auto-generate tables or class files. If you have an existing database and you'd like to auto-generate the class files, something like the following line of bash should suffice:

mysql dbname -e 'show tables'|grep -v dbname|awk '{print "<?php\nclass "$0" extends ormclass {\n\n}\n?>"}' > objects.php

In closing, the point of this was simply to prove that SQL statements can actually be eliminated from the high-level code entirely; and to provide some easily accessible API. The PDO version should be able to handle a few more complex tasks, like table scans and complex joins to create meta-objects from multiple tables. I also plan to extend the compatibility to include PostgreSQL and perhaps even port this to additional programming languages. At any rate, please enjoy your newfound ability to kick back and lazily write database powered applications. Happy hacking.