Friday, January 31, 2014

Rethinking Some Measures

So, I've been reading more about how RSiena (and stochastic actor-oriented models) work, and I've been rethinking a few things.

The Problem

Basically, they are expecting both networks and behavior to be fairly similar between observations. For example, this paper says that
A dynamic network consists of ties between actors that change
over time. A foundational assumption of the models discussed in
this paper is that the network ties are not brief events, but can be
regarded as states with a tendency to endure over time. Many rela-
tions commonly studied in network analysis naturally satisfy this
requirement of gradual change, such as friendship, trust, and coop-
eration. Other networks more strongly resemble ‘event data’, e.g.,
the set of all telephone calls among a group of actors at any given
time point, or the set of all e-mails being sent at any given time
point. While it is meaningful to interpret these networks as indi-
cators of communication, it is not plausible to treat their ties as
enduring states, although it often is possible to aggregate event
intensity over a certain period and then view these aggregates as
indicators of states.
While the RSiena manual says that:
In the models, behavioral variables can be binary or ordinal discrete (the extension for con-
tinuous behavioral data is currently being developed). The number of categories should
be small (mostly 2 to 5; larger ranges are possible). In the case of behaviors, Stochastic
Actor-Oriented Models express how actors increase, decrease, or maintain the level of their

Some Ideas 

My plan is to create 4 different types of networks - observation, local communication, global communication, and collaboration - and to use these to try to predict how many edits a user would make.

I'm realizing that the way I'm measuring these networks, as well as measuring activity, do not match with the way that SAOM works. For example, someone who observes the work of another user only one time in a period (defined as being the next editor of a page) is quite unlikely to observe the same user again. That is, this definitely represents an "event" instead of a "state".

I haven't decided for sure what to do. My tentative plan is to change my observation network into a dynamic dyadic covariate. As I understand it, this would still test whether or not observing someone else's page makes you more likely to be active.

I think that my other networks can be aggregated, and be made to represent states. That is, users who have communicated via talk pages X times in a period can be considered as having a "communicates with each other" link.

Behavior has to change, too

I definitely need to change my behavior variable, too. Like in most "peer production" environments, activity level follows a power-law curve, with a few users making lots of edits, and most users making a few. This makes this variable tough to model. I tried using quartiles, but even that is messy with this sort of data.

For now, I'm thinking that I will use the number of days they were active (i.e., made an edit) in the period as a measure of activity. This has a clear max, and should have a much less skewed distribution. In addition, my research is situated in the theory of communities of practice, which says that as people join a community, it becomes part of not only what they do, but who they are. In this sense, I think that the days active measure represents how engaged someone is in the community, even if they aren't making lots of edits.

P.S. For now, I decided to just keep my data as adjacency matrices, and not worry about converting it to edge lists. Maybe after I get my analysis done for the Sunbelt conference!

Saturday, January 25, 2014

Edgelists in RSiena

So, I've been storing my networks on my computer as full matrices. I have 3 different networks I'm looking at, with monthly snapshots, over 5 years, so there are a lot of files, and the full matrices are bulky.

In reading the RSiena manual, I saw that the native Siena formula is an edgelist, in the format "senderID, receiverID, value, wave". I thought that converting my matrices to edgelists in that format would really reduce the size of the files (not to mention being easier to debug as needed).

So, I've been trying to get RSiena to accept a very simple made-up edgelist that I created. It looks something like:

2  3  1  1
3  4  1  1
2  4  1  1
3  2  1  1
4  3  1  1
4  2  1  1

However, when I ran sienaDataCreateFromSession('../fakeEdgeList.txt') I got:

Error in if (session$Type[i] == "exogenous event") { :
   argument is of length zero

After re-reading the help(sienaDataCreateFromSession) file again, I realized that it's actually expecting a sort of config file, and the file names are outlined there. So, I created my config file.

data1,observation,../fakeEdgeList.txt,Siena,,1 2,network,Yes,.,1,6

I then got
Error: could not find function "network.size"

I realized I needed to install the 'network' package.

However, I'm still not there. After install network, I get

Error in network.size(namefiles[[1]]) :
  network.size requires an argument of class network.

So, for some reason, my file isn't being converted to a network object, but I'm going to have to wait until Monday to try to figure out why. :)

An Introduction


I am a Master's student at Purdue, working on my thesis project. For my project, I am looking at the edits made in the online genealogy community WeRelate, from the framework of Social Network Analysis.

My overall theory is that people in communities (like WeRelate) move through different patterns of behavior. For example, at first they are novices - they work mostly alone, and don't do too much. As they learn more about the community and the technology, they start to collaborate more, do more work, and do specialized work.

My goal is to use machine learning (specifically clustering) to identify different "behavioral signatures", and then to track how people move through those behavioral patterns. Specifically, I want to test how interactions with others affect how/whether/when people change their behavioral patterns.

Purpose of this Blog

In order to study this, I am using a number of tools that I have used only casually before. Namely, postgreSQL, R, and RSiena. While I have quite a lot of experience writing Python scripts to do data manipulation, the scale of this data is much larger than anything I've used before. There are 15.5 million edits that are tracked, stored in a giant XML document. I initially tried to manipulate the document directly with Python, but after far too much time spent waiting for my scripts to go through that giant file, realized that solution wouldn't work, and I moved things into a PSQL database.

I have a few goals for this blog:

  1. To write down what I'm working on, which will hopefully motivate me to keep going.
  2. To provide a resource for others who are trying to do similar things. For RSiena in particular, it's been very tough to find beginner-level resources.
  3. Ideally, to find some people who can give me advice and suggestions.
  4. To prove to my wife  (and committee) how hard I'm working. :)
So, for the most part, my posts will be quite technical, outlining what I'm working on, and what I (have or haven't) learned.