An Ego-Centric Researcher: 2014

Friday, October 17, 2014

Creating LaTeX PDF from included file

I learned a neat trick today. One of those things that once you learn it, you wonder how/why you were so long without it.

I've been working on revising my thesis to try to publish it as a few papers. Because it's such a long paper, I have a 'thesis.tex' file, which is basically a bunch of include statements, to include each of the chapters. As I'm editing it, I open each chapter in a tab in vim, and until today, whenever I wanted to recompile the PDF, I would move to the 'thesis.tex' tab and recompile from there.

So, today I found out that you can set up a .tex file as a "main" file, so that when you '\ll' to compile your .tex code, it will compile that file, instead of the file you're currently looking at.

Instructions:
http://vim-latex.sourceforge.net/documentation/latex-suite/latex-project.html#latex-project-example

Friday, September 5, 2014

Importing Edgelists into RSiena

So, importing data into RSiena is a bit of a pain. The GUI has some support for importing Pajek files, for example, but I've been working mostly from the command line, and with .R files, which are what the manual covers.

For my current project, I have CSV files in a very common edgelist format, something like -

sourceID,receiverID,weight,wave

I think it should be simple to import these into RSiena, but it isn't.

RSiena accepts either adjacency matrices - which are matrices with a 0 or 1 in each spot, for each node - or sparse matrices. These are similar to edgelists, but they have to be in the dgTMatrix class. As you can tell by reading the documentation, it's not exactly obvious how to get the data into that format.

I started by trying the Matrix() function, then I found the sparseMatrix() function. I realized that weight didn't matter, so I simply ignored the weight column. This creates a sparse matrix of the type "ngCMatrix", which is a "pattern matrix", and can't be coerced to a dgTMatrix.

So, eventually, I ended up creating a new weight column, with everything set to 1, and reset to 1 if there are duplicate entries in the data.

My current code is below:

 edgeListToAdj <- function(x, waveID){   
     # Remove entries who are not connect to anyone (NomineeID == 0), and not the   
     # current wave   
     tempNet <- x[x$NomineeID > 0 & x$NomineeID <= nodeCount & x$Wave == waveID,]   
     # Create a binary column for weights (since RSiena doesn't use weights).   
     tempNet$Weight <- 1   
     # Convert network obejct to adjacency matrix   
     adjacencyMat <- sparseMatrix(tempNet$NomineeID, tempNet$RespondentID, x=tempNet$Weight,   dims=c(nodeCount,nodeCount))   
     # If any items appear more than once, re-binarize them.   
     # Yes, binarize is a real word.   
     adjacencyMat[adjacencyMat > 1] <- 1   
     # Convert to a dgTMatrix, since this is what RSiena expects   
     return(as(adjacencyMat, "dgTMatrix"))   
 }   
 createNetwork <- function(fileName, numWaves) {  
     print(fileName)  
     # Convert CSV file to data frame  
     netDF <- as.data.frame(read.csv(fileName))  
     # Create an array of adjacency networks  
     net <- lapply(1:numWaves, function(x) edgeListToAdj(netDF, x))  
     # Change this into an RSiena network  
     RSienaObj <- sienaDependent(net)  
     return(RSienaObj)  
 }

Tuesday, July 15, 2014

Thesis Accepted

My thesis is defended, edited, and accepted.

It's online at
https://www.academia.edu/7544796/ONLINE_NATURALIZATION_EVOLVING_ROLES_IN_ONLINE_KNOWLEDGE_PRODUCTION_COMMUNITIES

I'm pretty proud of how it turned out. I didn't get all of the results I was hoping for, but that's research.

Thursday, July 3, 2014

Thesis Defended!

I haven't kept up on this blog as much as I should have, but my thesis is done and defended!

Saturday, April 12, 2014

An Analysis of Interactions on the Boston Subway

I made my first ever visit to Boston this past week, and while I was there, I was able to put together a small research project. On a subway ride, I carefully noted each of the interactions between the 50 riders in the train car I was in, during a 15 minute ride. Edge weight represents number of minutes spent talking to each other.

Nodes are colored based on degree centrality, sized based on eigenvector centrality, and spaced with Fruchterman-Reingold in Gephi.

So far, the research is merely descriptive, but I think some real insights could be made through running an actor-oriented model.

Thursday, April 3, 2014

.vimrc and Dropbox

If you have used vim for very long, you have almost certainly made some modifications to your .vimrc file - this is the file that stores configurations for how vim does things like tabs, syntax highlighting, etc.

If you use more than one computer, I highly recommend keeping your .vimrc file on the cloud. It's incredibly simple, and provides for a consistent experience across computers.

This Stack Overflow post gives simple instructions on how to do this.

Saturday, March 29, 2014

Roles Visualization

So, I thanks to some help from the very kind BrodieG on Stack Overflow, I was finally able to get some visualizations of the way that roles change over time. I am using the following code (as you can see, I tried to learn how to do melting and reshaping, then kind of gave up that hope - maybe another time).

library(reshape2)
library(ggplot2)


#clusters.mlt <- melt(clusters, id.vars="id")
#clusters.agg <- aggregate(. ~ id + variable, clusters.mlt, sum)

# The minimum number of times a user has to be in a given group in order to
# be shown in the graph for that group
minMonths = 2

makeGraph <- function(clusters){
        clus1 <- apply(clusters, 2, function(x) {sum(x=='1', na.rm=TRUE)})
        clus2 <- apply(clusters, 2, function(x) {sum(x=='2', na.rm=TRUE)})
        clus3 <- apply(clusters, 2, function(x) {sum(x=='3', na.rm=TRUE)})
        clus0 <- apply(clusters, 2, function(x) {sum(x=='0', na.rm=TRUE)})
        clusters2 <- data.frame(clus0, clus1, clus2, clus3)
        c2 <- t(clusters2)
        c3 <- as.data.frame(c2)
        c3$id = c('Low Activity Cluster', 'Cluster 1', 'Cluster 2', 'Cluster 3')
        c3 <- c3[order(c3$'id'),]
        return(ggplot(melt(c3, id.vars="id")) +
          geom_area(aes(x=variable, y=value, fill=id, group=id), position="fill"))
}
#print(ggplot(clusters.mlt) +
 # stat_summary(aes(x=variable, y=value, fill=id, group=id), fun.y=sum, position="fill",        geom="area"))

# Stats for just those who were in each group

clusterDF <- as.data.frame(read.csv('clustersByID.csv'))
ggsave(file="../Results/allUsers.png", plot=makeGraph(clusterDF))
cl1 <- clusterDF[apply(clusterDF, 1, function(x) {sum(x[2:76] == "1", na.rm=TRUE) >=            minMonths}),]
ggsave("../Results/Role1_2+.png", makeGraph(cl1))

And this is what the code produces (some new colors would probably be a good thing to work on next!)

There are some interesting things going on here, but no clear movement into the central-type role (Role 1).

Pasting text in Vim

So, I use Vim to do all of my programming, as well as for writing my thesis. Despite how often I use Vim, I'm not really a very expert user (as you will soon learn).

So, it's not uncommon that I will want to copy and paste a code snippet from a website (usually StackOverflow), either to use it in my own code, or to figure out how it works.

When you paste it into vim, it really looks terrible. Everything is indented like crazy, and it's almost unusable. When it's a small snippet, it doesn't take long to fix it, but for longer pieces of code, it's a huge pain.

I always thought that it was just some poor programming on StackOverflow - that they put in a bunch of hidden tabs or something. But, I came across this article today, and realized that the problem is actually how Vim handles the pasted text.

Basically, before you want to paste a hunk of indented text from somewhere else, run

:set paste

Then, when you are done pasting to your heart's content, run

:set nopaste

and life will be good.

Thursday, March 20, 2014

Visualizing Changing Roles over time

So, my main research question is about how people move through various roles in a community over time.

In the end, I will be using social network analysis as a driver for why people move through roles, but I will start with identifying what the roles are, and some summary statistics of how people move through them.

In order to identify roles, I created monthly activity snapshots for each user, and then used a clustering algorithm in R to automatically identify different "behavioral roles". There is some evidence that the data don't cluster cleanly, but clustering isn't a central component of my research, so I am moving forward anyway.

I decided to use the k-mediods (aka "partitioning around mediods" or "pam") algorithm in R. I used the silhouette function to identify the best k (which was 3 clusters).

The data I used to create the clusters only included those months where a user made at least 5 edits. So, I created a 4th cluster to represent months where a user made less than 5 edits, and used a python program to add the cluster results to the original stats file.

I then wrote another python script to rearrange this data, so that it is in the format

ID   Month1    Month2   Month3 ...
1    1         2        2
2    1         0        2

...

so that Month1 is the user's role in their first month, Month2 the role in that user's 2nd month, etc.

For way, way too long today I've been trying to get R to display this data as a stacked area graph of the ratio of roles by month. It's been a huge pain to try to figure out how to reshape it, etc.

Once I can get that, I want to compare that graph to the graph of those who were in each role at least X times during their tenure.

Friday, March 14, 2014

Sunbelt

I had a very good time at Sunbelt.

I finally got a model that would converge in RSiena. My plan was to start with a simple model, and then to add more of my hypotheses once I got the simple model working. It turned out to be tough enough to get the simple model to converge that I just went with that once I had it working.

At Sunbelt, I went to a great workshop by Tom Snijders, which was very helpful. He also came to the poster session, and gave me some great tips about how to work with my model.

Friday, January 31, 2014

Rethinking Some Measures

So, I've been reading more about how RSiena (and stochastic actor-oriented models) work, and I've been rethinking a few things.

The Problem

Basically, they are expecting both networks and behavior to be fairly similar between observations. For example, this paper says that

A dynamic network consists of ties between actors that change
over time. A foundational assumption of the models discussed in
this paper is that the network ties are not brief events, but can be
regarded as states with a tendency to endure over time. Many rela-
tions commonly studied in network analysis naturally satisfy this
requirement of gradual change, such as friendship, trust, and coop-
eration. Other networks more strongly resemble ‘event data’, e.g.,
the set of all telephone calls among a group of actors at any given
time point, or the set of all e-mails being sent at any given time
point. While it is meaningful to interpret these networks as indi-
cators of communication, it is not plausible to treat their ties as
enduring states, although it often is possible to aggregate event
intensity over a certain period and then view these aggregates as
indicators of states.

While the RSiena manual says that:

In the models, behavioral variables can be binary or ordinal discrete (the extension for con-
tinuous behavioral data is currently being developed). The number of categories should
be small (mostly 2 to 5; larger ranges are possible). In the case of behaviors, Stochastic
Actor-Oriented Models express how actors increase, decrease, or maintain the level of their
behavior.

Some Ideas

My plan is to create 4 different types of networks - observation, local communication, global communication, and collaboration - and to use these to try to predict how many edits a user would make.

I'm realizing that the way I'm measuring these networks, as well as measuring activity, do not match with the way that SAOM works. For example, someone who observes the work of another user only one time in a period (defined as being the next editor of a page) is quite unlikely to observe the same user again. That is, this definitely represents an "event" instead of a "state".

I haven't decided for sure what to do. My tentative plan is to change my observation network into a dynamic dyadic covariate. As I understand it, this would still test whether or not observing someone else's page makes you more likely to be active.

I think that my other networks can be aggregated, and be made to represent states. That is, users who have communicated via talk pages X times in a period can be considered as having a "communicates with each other" link.

Behavior has to change, too

I definitely need to change my behavior variable, too. Like in most "peer production" environments, activity level follows a power-law curve, with a few users making lots of edits, and most users making a few. This makes this variable tough to model. I tried using quartiles, but even that is messy with this sort of data.

For now, I'm thinking that I will use the number of days they were active (i.e., made an edit) in the period as a measure of activity. This has a clear max, and should have a much less skewed distribution. In addition, my research is situated in the theory of communities of practice, which says that as people join a community, it becomes part of not only what they do, but who they are. In this sense, I think that the days active measure represents how engaged someone is in the community, even if they aren't making lots of edits.

P.S. For now, I decided to just keep my data as adjacency matrices, and not worry about converting it to edge lists. Maybe after I get my analysis done for the Sunbelt conference!

Saturday, January 25, 2014

Edgelists in RSiena

So, I've been storing my networks on my computer as full matrices. I have 3 different networks I'm looking at, with monthly snapshots, over 5 years, so there are a lot of files, and the full matrices are bulky.

In reading the RSiena manual, I saw that the native Siena formula is an edgelist, in the format "senderID, receiverID, value, wave". I thought that converting my matrices to edgelists in that format would really reduce the size of the files (not to mention being easier to debug as needed).

So, I've been trying to get RSiena to accept a very simple made-up edgelist that I created. It looks something like:

2 3 1 1
3 4 1 1
2 4 1 1
3 2 1 1
4 3 1 1
4 2 1 1

However, when I ran sienaDataCreateFromSession('../fakeEdgeList.txt') I got:

Error in if (session$Type[i] == "exogenous event") { :
argument is of length zero

After re-reading the help(sienaDataCreateFromSession) file again, I realized that it's actually expecting a sort of config file, and the file names are outlined there. So, I created my config file.

Group,Name,Filename,Format,Period,ActorSet,Type,Selected,MissingValues,NonZeroCode,NbrOfActors
data1,observation,../fakeEdgeList.txt,Siena,,1 2,network,Yes,.,1,6

I then got
Error: could not find function "network.size"

I realized I needed to install the 'network' package.

However, I'm still not there. After install network, I get

Error in network.size(namefiles[[1]]) :
network.size requires an argument of class network.

So, for some reason, my file isn't being converted to a network object, but I'm going to have to wait until Monday to try to figure out why. :)

An Introduction

Background

I am a Master's student at Purdue, working on my thesis project. For my project, I am looking at the edits made in the online genealogy community WeRelate, from the framework of Social Network Analysis.

My overall theory is that people in communities (like WeRelate) move through different patterns of behavior. For example, at first they are novices - they work mostly alone, and don't do too much. As they learn more about the community and the technology, they start to collaborate more, do more work, and do specialized work.

My goal is to use machine learning (specifically clustering) to identify different "behavioral signatures", and then to track how people move through those behavioral patterns. Specifically, I want to test how interactions with others affect how/whether/when people change their behavioral patterns.

Purpose of this Blog

In order to study this, I am using a number of tools that I have used only casually before. Namely, postgreSQL, R, and RSiena. While I have quite a lot of experience writing Python scripts to do data manipulation, the scale of this data is much larger than anything I've used before. There are 15.5 million edits that are tracked, stored in a giant XML document. I initially tried to manipulate the document directly with Python, but after far too much time spent waiting for my scripts to go through that giant file, realized that solution wouldn't work, and I moved things into a PSQL database.

I have a few goals for this blog:

To write down what I'm working on, which will hopefully motivate me to keep going.
To provide a resource for others who are trying to do similar things. For RSiena in particular, it's been very tough to find beginner-level resources.
Ideally, to find some people who can give me advice and suggestions.
To prove to my wife (and committee) how hard I'm working. :)

So, for the most part, my posts will be quite technical, outlining what I'm working on, and what I (have or haven't) learned.