Outreachy progress 2019-03

Summary of work:

  • Approved 1,013 initial applications total
  • Investigated an issue with KiwiIRC being banned on some IRC servers
  • Fixed some IRC link issues in the Outreachy website
  • Promoted new projects on Twitter and the Outreachy announce list
  • Sent semi-automated emails to remind applicants of the final application deadlines
  • Answered applicant’s questions, as time allowed
  • Communicated with potential Outreachy sponsors
  • Chased down some outstanding invoices from the December 2018 round
  • Communicated with December 2018 interns who had internship extensions

Outreachy progress: 2019-02

Summary of work this month:

  • Created final feedback form for interns and mentors
  • Contacted potential communities for the May to August 2019 round
  • Updated questions on the initial application form
  • Updated the website to the latest stable version of Django 1.11
  • Wrote a blog post announcing changes in eligibility criteria
  • Promotion on Twitter, emailing diversity in tech groups, job boards postings
  • Reviewed 874 initial application essays

The Outreachy internship program opened applications for the May to August round. Most of the time this month has been reviewing the 1,235 initial applications that have been submitted. ?

We’re definitely getting more applications this round. After the six week application period for the December to March round, we processed 1,817 initial applications. Less than two weeks into this round, we’ve had 1,235 initial applications submitted.

That sounds like a huge number, but that’s where the magic of Django comes in. Django allows us to collect time commitment information from all the applicants. We create a calendar of their time commitments and then see if they have 49 consecutive days free from full-time commitments from during the internship period.

So far, about 181 initial applications have been rejected because applicants had full-time commitments. (The number is usually higher in the December round because students in the northern hemisphere have a shorter break.)

We also check whether people are eligible to work in the countries they’re living in, whether people have participated in Outreachy or Google Summer of Code before, etc. There are 72 applications that were automatically denied because of those kinds of issues.

That leaves 982 applicants who were eligible for Outreachy so far. ? And we have to manually review every single applicant essay to see whether supporting this person would align with Outreachy’s program goal to support marginalized people in tech.

We ask specific essay questions to determine whether the applicant is underrepresented. We ask two more essay questions to determine whether they face discrimination or systemic bias in their learning environment or when looking for employment opportunities. Applicants have to demonstrate both characteristics. They have to be underrepresented *and* face discrimination.

It’s quite frankly difficult to spend 5-9 hours a day reading about the discrimination people face. We ask for personal stories, and people open up with some real horror stories. It’s probably re-traumatizing for them. It certainly impacts my mental health. Other people share less specific experiences with discrimination, which is also fine.

Sometimes reading essays introduces me to types of discrimination that are unfamiliar to me. For example, I’ve been reading more about the caste system in India and ethnic/tribal discrimination in Africa. Reading the essays can be a learning experience for me, and I’m glad we have multiple application reviewers from around the world.

One of the hardest things to do is to say no to an initial application.

Sometimes it’s clear from an essay that someone is from a group underrepresented in the technology industry of their country, but their learning environment is supportive and diverse, and they don’t think they’ll face discrimination in the workplace. Outreachy has to prioritize supporting marginalized people in tech, even if that means turning down underrepresented people who have the privilege to not face discrimination.

It’s also difficult because a lot of applicants who aren’t from groups underrepresented in tech equate hardship with discrimination. For example, a man being turned down for a job because they don’t have enough technical experience could be considered hardship. Interviewers assuming a woman doesn’t have technical experience because they’re a woman is discrimination. The end result is the same (you don’t get the job because the interviewer thinks you don’t have technical experience), but the cause (sexisim) is different.

Sometimes systemic issues are at play. For example, not having access to your college’s library because you have a mobility device and there’s no elevator is both discrimination and a systemic issue. Some communities face gender violence against women. The violence means parents don’t allow women to travel away to college, and some universities to restrict women to their dorms in the evenings. Imagine not being able to study after class, or not having internet in your dorms to do research. The reaction to these systemic issues incorrectly punish the people who are most likely to face harassment.

It’s frustrating to read about discrimination, but I hope that working with Outreachy mentors gives people an opportunity they wouldn’t otherwise have.

Outreachy Progress: 2019-01

Summary

  • Finished cleaning up the technical debt that kept us from having two Outreachy rounds active at once
  • Added code for gathering internship midpoint feedback
  • Migrated the travel stipend page off the old wiki for Outreachy to the Django website
  • Added a required field for mentors to provide the minimum computer system requirements to contribute to the project
  • Created intern blog post prompts for weeks 5 & 7
  • Followed up on all December 2018 sponsorship invoices

Minimum System Requirements

New for this Outreachy round is asking mentors to provide the minimum system requirements for their project. Many Outreachy applicants have second-hand, 10 year old systems. They may not have the memory to be able to run a virtualized development environment. In the past, we’ve had applicants who tried to follow installation instructions to complete their required contribution, only to have their systems hang.

By requiring mentors to provide minimum system requirements for their projects, we hope to help applicants who can’t afford a newer computer. We also hope that it will help communities think about how they can lower their technology barriers for applicants who face socioeconomic hardship

Simplifying Language

This month I migrated the travel stipend instructions page from our old wiki to the new travel page. During that migration, I noticed the language in the page was filled with complex vocabulary and longer sentences. That’s how I tend to write, but it’s harder for people who speak English as a second language to read.

I used the Hemmingway editor to cut down on complex sentences. I would recommend that people look at similar tools to simplify their language on their website

Debt, debt, and more technical debt

I had hoped that January would be spent contacting Outreachy communities to notify them of the round. Unfortunately, Outreachy website work took priority, as it wasn’t ready for us to accept community sign-ups.

Most of the work was done on cleaning up the technical debt I talked about in my last blog post. The website has to handle having two internship rounds active at once. For example, in January, mentors were submitting feedback for the December 2018 internships, while other mentors were submitting projects for the upcoming May 2019 internships.

A lot of the process was deciding how long to display information on the website. For example, when should mentors be able to choose an applicant as an intern for their project?

Mentors could find a potential candidate very early in the application period, so the very soonest they could choose an intern would be when the application period starts.

Most people might assume that interns can’t be selected after we announce the internships. However, in the past, interns have decided not to participate, so mentors have needed to select another applicant after the interns are announced. The very latest they could select an intern would be five weeks after the internships start, since we can’t extend an internship for more than five weeks.

It’s a complex process to decide these dates. It requires a lot of tribal knowledge of how the Outreachy internship processes work. I’m happy to finally document some of those assumptions into the Outreachy website code.

Outreachy Progress: 2018-12

One of my resolutions for 2019 is to be more transparent about the work I’ve been doing for Outreachy. Hopefully (fingers crossed) this means you’ll be seeing a blog post once a month.

I’ll also throw in a selfie per month. My face is changing since I’ve been on hormone replacement therapy (testosterone) for about 7 months now. I started to get some peach fuzz around month 5. It’s still patchy, but I’m growing it out anyway so I can see if I can get a beard!

New glasses too!

What is Outreachy?

Outreachy is a three-month internship program. It’s completely remote (both interns and mentors come from around the world). We pay the interns a $5,500 USD stipend for the three months, plus a $500 travel stipend to attend a conference or event related to their internship or free software.

The goal of the internship is to introduce people to free and open source software. Outreachy has projects that involve programming, documentation, graphic design, user experience, user advocacy, and data science.

Outreachy’s other goal is to support people from groups underrepresented in the technology industry. We expressly invite women (both cis and trans), trans men, and genderqueer people to apply. We also expressly invite applications from residents and nationals of the United States of any gender who are Black/African American, Hispanic/Latin@, Native American/American Indian, Alaska Native, Native Hawaiian, or Pacific Islander. Anyone who faces under-representation, systemic bias, or discrimination in the technology industry of their country is invited to apply.

What’s My Role?

I own Otter Tech LLC, which is a diversity and inclusion consulting company. It’s been my full-time job since July 2016. I work with clients (mostly in the technology or free software space) that want to improve their culture and better support people from groups underrepresented in tech. Outreachy is one of my clients.

I am one of five Outreachy organizers. Two of us (Marina Zhurakhinskaya and I) are heavily involved in running the internship application process. Karen Sandler is great at finding funding for us. The whole Outreachy organizers team (including Tony Sebro and Cindy Pallares-Quezada) makes important decisions about the direction of the program.

Outreachy also recently hired two part-time staff members. They’ve been helping Outreachy applicants during the application period, and then also helping Outreachy interns when the internship is running. We don’t have a good name for their role yet, but we’ve sort of settled on “Outreachy Helpers”

December 2018 Progress

The December 2018 to March 2018 internship round kicked off on December 4. Usually that’s downtime for me as an Outreachy organizer, because mentors and coordinators step up to interact with their interns. In the past, the only real interaction the Outreachy organizers had with interns was if their mentor indicated they were having issues (yikes!). This month was spent increasing the frequency and types of check-ins with interns and mentors.

Outreachy Chat Server

This round, we’re trying something new to have the Outreachy interns talk with Outreachy organizers and with each other. We’ve set up a private invitation-only Zulip chat server, and invited all the Outreachy organizers, interns, mentors, and coordinators. I’ve been doing a bit of community management, participating in discussions, and answering questions that Outreachy interns have as they start their internship. I also ran a text-based discussion and then a video chat for Outreachy interns to do a second week check-in.

I think the Outreachy Zulip chat has worked out well! I see interns connecting across different free software communities, and mentors from other communities helping different interns. Zulip has the concept of “streams” which are basically chat rooms. We have a couple of different streams, like a general chat channel and a channel for asking questions about Outreachy internship procedures. I’m fairly certain that I got more questions on the Zulip chat from interns than we ever got by using email and IRC.

Frequent Feedback

The other thing we’re doing this round is collecting feedback in a different way. In the past, we collected it at two points during the internship. The midpoint was at 6 weeks in and the final feedback was at 12 weeks in. However, this round, we’re collecting it at three points: initial feedback at 2 weeks in, midpoint feedback at 8 weeks in, and final feedback at 12 weeks.

Collecting feedback three times meant more overhead for evaluating feedback and sending the results to our fiscal sponsor, the Software Freedom Conservancy. I wrote code in December to allow the Outreachy internship website to collect feedback from mentors as to whether interns should be paid their initial stipend.

We’re also collecting different feedback this round. I’m collecting feedback from both interns and mentors, based on a suggestion from a former Outreachy intern. Interns and mentors are asked the same questions, like “How long does it take (you/your intern) to respond to questions or feedback?” and “How long does it take (your mentor/you) to respond to questions and feedback?” That way, I can compare people’s self-evaluations with what the other person involved in the internship thinks.

There’s also a freeform-text for interns to give feedback on how their mentor is doing. This is important, because many Outreachy mentors are new to mentoring. They may need to have some coaching to understand how they can be more supportive to their interns. While most of the interns are doing great, I can see that I’m going to need to nudge a couple of mentor and intern pairs in the right direction.

Interviews with Alums

I did video interviews with five Outreachy interns at the Mozilla All Hands in December 2019. I loved interviewing them, because it’s great to hear their personal stories. I’ll be using the footage to create videos to promote the Outreachy program.

I’ve created short-hand transcripts of two of the videos, but haven’t gotten to the other five. Transcripts help for a couple reasons. Most importantly, I can add closed captioning to the finished videos. I also have a searchable text database for when I need to find quotes about a particular topic. Seeing the text allows me to group similar experiences and create a cohesive narrative for the promotional video.

Ramping up for May 2019 Internships

The Outreachy December 2018 to March 2019 internships are just starting, but we’re already thinking of the next round. January is typically the time we start pinging communities to see if they want to be involved in mentoring interns during the February to March application period.

That means we need to have the website ready to handle both a currently running internship cohort, and a new internship round where mentors can submit projects. There’s some technical debt in the Outreachy website code that we need to address before we can list the next round’s internship dates.

The Outreachy website is designed to guide internship applicants through the application process. It’s built with a web framework tool called Django, which is written in Python. Django makes web development easier, because you can define Python classes that represent your data. Django then uses those classes to create a representation in the database. The part of Django that translates Python into database schema is called the ORM (Object Relational Mapper).

For example, the Outreachy website keeps track of internship rounds (the RoundPage class). Each internship round has dates and other information associated with it. For example, it has the date for when the application period starts and ends, and when the internship starts and end.

It makes sense to store internship rounds in a database, because all internship rounds have the same kinds of deadlines associated with them. You can do database queries to find particular rounds in the database. For example, the Django Python code to look up the latest round (based on when the interns start their internship) is RoundPage.objects.latest(‘internstarts’).

The work I’ve recently been doing is to deal with the fact that two internship rounds can be active at once. We’re about to open the next internship round for mentors to submit new projects. On February 18, the next application period will open. But the December 2018 round of internships will still be active until March 4.

The Outreachy website’s pages has to deal with displaying data from multiple rounds. For example, on the Outreachy organizers’ dashboard page, I need to be able to send out reminder emails about final mentor feedback for the December 2018 round, while still reviewing and approving new communities to participate in the May 2019 round. Outreachy mentors need to still be able to submit feedback for their current intern in the December 2018 round, while (potentially) submitting a new project for the May 2019 round.

It’s mostly a lot of refactoring and debugging Python code. I’m writing more Django unit tests to deal with corner cases. Sometimes it’s hard to debug when something fails in the unit test, but doesn’t fail in our local deployment copy. I’m fairly new to testing in Django, and I wrote my first test recently! I feel really silly for not starting on the tests sooner, but I’m slowly catching up to things!

What’s Next?

January 2019 is going to be spent contacting communities about participating in the May 2018 to August 2018 round. I have some video footage of Outreachy interns I interviewed at the Tapia conference and Mozilla All Hands, and I hope to put it into a promotional video to inspire people to become mentors. It’s a fun exercise that uses some of the video editing skills I have from making fanvideos.

I’ll also be at FOSDEM in February 2019. If you’re there, find me in either the Software Freedom Conservancy booth on Saturday, or the Community devroom on Sunday. I’ll also be helping out with the Copyleft Conference on Monday.

I’ll be giving a talk at FOSDEM on changing team culture to better support people with impostor syndrome. The goal is not to ask people with impostor syndrome to change, but instead to figure out how to change our culture so that we don’t create or trigger impostor syndrome. The talk is called “Supporting FOSS Community Members with Impostor Syndrome“. The talk will be from 9:10am to 9:40am on Sunday (the first talk slot).


Update on Sentiment Analysis of FOSS communities

One of my goals with my new open source project, FOSS Heartbeat, has been to measure the overall sentiment of communication in open source communities. Are the communities welcoming and friendly, hostile, or neutral? Does the bulk of positive or negative sentiment come from core contributors or outsiders? In order to make this analysis scale across multiple open source communities with years of logs, I needed to be able to train an algorithm to recognize the sentiment or tone of technical conversation.

How can machine learning recognize human language sentiment?

One of the projects I’ve been using is the Stanford CoreNLP library, an open source Natural Language Processing (NLP) project. The Stanford CoreNLP takes a set of training sentences (manually marked so that each word and each combined phrase has a sentiment) and it trains a neural network to recognize the sentiment.

The problem with any form of artificial intelligence is that the input into the machine is always biased in some way. For the Stanford CoreNLP, their default sentiment model was trained on movie reviews. That means, for example, that the default sentiment model thinks “Christian” is a very positive word, whereas in an open source project that’s probably someone’s name. The default sentiment model also consistently marks any sentence expressing a neutral technical opinion as having a negative tone. Most people leaving movie reviews either hate or love the movie, and people are unlikely to leave a neutral review analyzing the technical merits of the special effects. Thus, it makes sense that a sentiment model trained on movie reviews would classify technical opinions as negative.

Since the Stanford CoreNLP default sentiment model doesn’t work well on technical conversation, I’ve been creating a new set of sentiment training data that only uses sentences from open source projects. That means that I have to manually modify the sentiment of words and phrases in thousands of sentences that I feed into the new sentiment model. Yikes!

As of today, the Stanford CoreNLP default sentiment model has ~8,000 sentences in their training file. I currently have ~1,200 sentences. While my model isn’t as consistent as the Stanford CoreNLP, it is better at recognizing neutral and positive tone in technical sentences. If you’re interested in the technical details (e.g. specificity, recall, false positives and the like), you can take a look at the new sentiment model’s stats. This blog post will attempt to present the results without diving into guided machine learning jargon.

Default vs New Models On Positive Tone

Let’s take a look at an example of a positive code review experience. The left column is from the default sentiment model in Stanford CoreNLP, which was trained on movie reviews. The right column is from the new sentiment model I’ve been training. The colors of the sentence encode what the two models think the overall tone of the sentence is:

  • Very positive
  • Positive
  • Neutral
  • Negative
  • Very negative

Hey @1Niels 🙂 is there a particular reason for calling it Emoji Code?

I think the earlier guide called it emoji name.

A few examples here would help, as well as explaining that the pop-up menu shows the first five emojis whose names contain the letters typed.

(I’m sure you have a better way of explaining this than me :-).

@arpith I called them Emoji code because that’s what they’re called on Slack’s emoji guide and more commonly only other websites as well.

I think I will probably change the section name from Emoji Code to Using emoji codes and I’ll include your suggestion in the last step.

Thanks for the feedback!

Hey @1Niels 🙂 is there a particular reason for calling it Emoji Code?

I think the earlier guide called it emoji name.

A few examples here would help, as well as explaining that the pop-up menu shows the first five emojis whose names contain the letters typed.

(I’m sure you have a better way of explaining this than me :-).

@arpith I called them Emoji code because that’s what they’re called on Slack’s emoji guide and more commonly only other websites as well.

I think I will probably change the section name from Emoji Code to Using emoji codes and I’ll include your suggestion in the last step.

Thanks for the feedback!

Default vs New Models On Positive Tone

For the default model trained on movie reviews, it rated 4 out of 7 of the sentences as negative and 1 out of 7 sentences as positive. As you can see, the default sentiment model that was trained on movie reviews tends to classify neutral technical talk as having a negative tone, including sentences like “I called them Emoji code because that’s what they’re called on Slack’s emoji guide and more commonly only other websites as well.” It did recognize the sentence “Thanks for the feedback!” as positive, which is good.

For the new model trained on comments from open source projects, it rated 1 sentence as negative, 2 as positive, and 1 as very positive. Most of the positive tone of this example comes from the use of smiley faces, which I’ve been careful to train the new model to recognize. Additionally, I’ve been teaching it that exclamation points ending a sentence that is overall positive shift the tone to very positive. I’m pleased to see it pick up on those subtleties.

Default vs New Models On Neutral Tone

Let’s have a look at a neutral tone code review example. Again, the sentence sentiment color key is:

  • Very positive
  • Positive
  • Neutral
  • Negative
  • Very negative

This seems to check resolvers nested up to a fixed level, rather than checking resolvers and namespaces nested to an arbitrary depth.

I think a inline-code is more appropriate here, something like “URL namespace {} is not unique, you may not be able to reverse all URLs in this namespace”.

Errors prevent management commands from running, which is a bit severe for this case.

One of these should have an explicit instance namespace other than inline-code, otherwise the nested namespaces are not unique.

Please document the check in inline-code.

There’s a list of URL system checks at the end.

This seems to check resolvers nested up to a fixed level, rather than checking resolvers and namespaces nested to an arbitrary depth.

I think a inline-code is more appropriate here, something like “URL namespace {} is not unique, you may not be able to reverse all URLs in this namespace”.

Errors prevent management commands from running, which is a bit severe for this case.

One of these should have an explicit instance namespace other than inline-code, otherwise the nested namespaces are not unique.

Please document the check in inline-code.

There’s a list of URL system checks at the end.

Default vs New Models On Neutral Tone

Again, the default sentiment model trained on movie reviews classifies neutral review as negative, ranking 5 out of 6 sentences as negative.

The new model trained on open source communication is a bit mixed on this example, marking 1 sentence as positive and 1 negative, out of 6 sentences. Still, 4 out of 6 sentences were correctly marked as neutral, which is pretty good, given the new model has a training set that is 8 times smaller than the movie review set.

Default vs New Models On Negative Tone

Let’s take a look at a negative example. Please note that this is not a community that I am involved in, and I don’t know anyone from that community. I found this particular example because I searched for “code of conduct”. Note that the behavior displayed on the thread caused the initial contributor to offer to abandon their pull request. A project outsider stated they would recommend their employer not use the project because of the behavior. Another project member came along to ask for people to be more friendly. So quite a number of people thought this behavior was problematic.

Again, the sentiment color code is:

  • Very positive
  • Positive
  • Neutral
  • Negative
  • Very negative

Dude, you must be kidding everyone.

What dawned on you – that for a project to be successful and useful it needs confirmed userbase – was crystal clear to others years ago.

Your “hard working” is little comparing to what other people have been doing for years.

Get humbler, Mr. Arrogant.

If you find this project great, figure out that it is so because other people worked on it before.

Learn what they did and how.

But first learn Python, as pointed above.

Then keep working hard.

And make sure the project stays great after you applied your hands to it.

Dude, you must be kidding everyone.

What dawned on you – that for a project to be successful and useful it needs confirmed userbase – was crystal clear to others years ago.

Your “hard working” is little comparing to what other people have been doing for years.

Get humbler, Mr. Arrogant.

If you find this project great, figure out that it is so because other people worked on it before.

Learn what they did and how.

But first learn Python, as pointed above.

Then keep working hard.

And make sure the project stays great after you applied your hands to it.

Default vs New Models On Negative Tone

For the default model trained on movie reviews, it classifies 4 out of 9 sentences as negative and 2 as positive. The new model classifies 2 out of 9 sentences as negative and 2 as positive. In short, it needs more work.

It’s unsurprising that the new model doesn’t currently recognize negative sentiment very well right now, since I’ve been focusing on making sure it can recognize positive sentiment and neutral talk. The training set currently has 110 negative sentences out of 1205 sentences total. I simply need more negative examples, and they’re hard to find because many subtle personal attacks, insults, and slights don’t use curse words. If you look at the example above, there’s no good search terms, aside from the word arrogant, even though the sentences are still put-downs that create an us-vs-them mentality. Despite not using slurs or curse words, many people found the thread problematic.

The best way I’ve settled on to find negative sentiment examples is to look for “communication meta words” or people talking about communication style. My current list of search terms includes words like “friendlier”, “flippant”, “abrasive”, and similar. Some search words like “aggressive” yield too many false positives, because people talk about things like “aggressive optimization”. Once I’ve found a thread that contains those words, I’ll read through it and find the comments that caused the people to ask for a different communication style. Of course, this only works for communities that want to be welcoming. For other communities, searching for the word “attitude” seems to yield useful examples.

Still, it’s a lot of manual labor to identify problematic threads and fish out the negative sentences that are in those threads. I’ll be continuing to make progress on improving the model to recognize negative sentiment, but it would help if people could post links to negative sentiment examples on the FOSS Heartbeat github issue or drop me an email.

Visualizing Sentiment

Although the sentiment model isn’t perfect, I’ve added visualization for the sentiment of several communities on FOSS Heartbeat, including 24pullrequests, Dreamwidth, systemd, elm, fsharp, and opal.

The x-axis is the date. I used the number of neutral comments in an issue or pull request as the y-axis coordinate, with the error bars indicating the number of positive and negative comments. If the comment had two times the number of negative comments as positive comments, it was marked as a negative thread. If the comment had two times the number of positive comments than negative comments, it was marked as positive. If neither sentiment won, and more than 80% of the comments were neutral, it was marked as neutral. Otherwise the issue or pull request was marked as mixed sentiment.

Here’s an example:

24pullrequests-sentiment

The sentiment graph is from the 24pullrequests repository. It’s a ruby website that encourages programmers to gift code to open source projects during the 24 days in December before Christmas. One of the open source projects you can contribute to is the 24 pull requests site itself (isn’t that meta!). During the year, you’ll see the site admins filing help-wanted enhancements to update the software that runs the website or tweak a small feature. They’re usually closed within a day without a whole lot of back and forth between the main contributors. The mid-year contributions show up as the neutral, low-comment dots throughout the year. When the 24 pull request site admins do receive a gift of code to the website by a new contributor as part of the 24 pull requests period, they’re quite thankful, which you can see reflected in the many positive comments around December and January.

Another interesting example to look at is negative sentiment in the opal community:

opal-negative-sentiment

That large spike with 1207 neutral comments, 197 positive comments, and 441 negative comments is the opal community issue to add a code of conduct. Being able to quickly see which threads are turning into flamewars would be helpful to community managers and maintainers who have been ignoring the issue tracker to get some coding done. Once the sentiment model is better trained, I would love to analyze whether communities become more positive or more neutral after a Code of Conduct is put in place. Tying that data to whether more or less newcomers participate after a Code of Conduct is in place may be interesting as well.

There are a lot of real-world problems that sentiment analysis, participation data, and a bit of psychology could help us identify. One common social problem is burnout, which is characterized by an increased workload (stages 1 & 2), working at odd hours (stage 3), and an increase in negative sentiment (stage 6). We have participation data, comment timestamps, and sentiment for those comments, so we would only need some examples of burnout to identify the pattern. By being aware of the burnout stages of our collaborators, we could intervene early to help them avoid a spiral into depression.

A more corporate focused interest might be to identify issues where their key customers express frustration and anger, and focus their developers on fixing the squeaky wheel. If FOSS Heartbeat were extended to analyze comments on mailing lists, slack, discourse, or mattersmost, companies could get a general idea of the sentiment of customers after a new software release. Companies can also use the participation and data about who is merging code to figure out which projects or parts of their code are not being well-maintained, and assign additional help, as the exercism community did.

Another topic of interest to communities hoping to grow their developer base would be identifying the key factors that cause newcomers to become more active contributors to a project. Is it a positive welcome? A mentor suggesting a newcomer tackle a medium-sized issue by tagging them? Does adding documentation about a particularly confusing area cause more newcomers to submit pull requests to that area of code? Does code review from a particularly friendly person cause newcomers to want to come back? Or maybe code review lag causes them to drop off?

These are the kinds of people-centric community questions I would love to answer by using FOSS Heartbeat. I would like to thank Mozilla for sponsoring the project for the last three months. If you have additional questions you’d love to see FOSS Heartbeat answer, I’m available for contract work through Otter Tech. If you’re thankful about the work I’ve put in so far, you can support me through my patreon.

What open source community question would you like to see FOSS Heartbeat tackle? Feel free to leave a comment.

Impact of bots on github communities

I’ve been digging into contributor statistics for various communities on github as part of my work on FOSS Heartbeat, a project to measure the health of open source communities.

It’s fascinating to see bots show up in the contributor statistics. For example, if you look at github users who comment on issues the Rust community, you’ll quickly notice two contributors who interact a lot:

rust-bots

bors is a bot that runs pull requests through the rust continuous integration test suite, and automatically merges the code into the master branch if it passes. bors responds to commands issued in pull request comments (of the form’@bors r+ [commit ID]’ by community members with permission to merge code into rust-lang/rust.

rust-highfive is a bot that recommends a reviewer based on the contents of the pull request. It then add a comment that tags the reviewer, who will get a github notification (and possibly an email, if they have that set up).

Both bots have been set up by the Rust community in order to make pull request review smoother. bors is designed to cut down the amount of time developers need to spend running the test suite on code that’s ready to be merged. rust-highfive is designed to make sure the right person is aware of pull requests that may need their experienced eye.

But just how effective are these github bots? Are they really helping the Rust community or are they just causing more noise?

Chances of a successful pull request

bors merged its first pull request on 2013-02-02. The year before bors was introduced, only 330 out of 503 pull requests were merged. The year after, 1574 out of 2311 pull requests were merged. So the Rust community had four times more pull requests to review.

Assuming that the tests bors used were some of the same tests rust developers were running manually, we would expect that pull requests would be rejected at about the same rate (or maybe rejected more, since the automatic CI system would catch more bugs).

To test that assumption, we turn to a statistics method called the Chi squared test. It helps answer the question, “Is there a difference in the success rates of two samples?” In our case, it helps us answer the question, “After bors was used, did the percentage of accepted pull requests change?”

rust-bors-merged

It looks like there’s no statistical difference in the chances of getting a random pull request merged before or after bors started participating. That’s pretty good, considering the number of pull requests submitted quadrupled.

Now, what about rust-highfive? Since the bot is supposed to recommend pull request reviewers, we would hope that pull requests would have a higher chance of getting accepted. Let’s look at the chances of getting a pull request merged for the year before and the year after rust-highfive was introduced (2014-09-18).

rust-highfive-merged

So yes, it does seem like rust-highfive is effective at getting the right developer to notice a pull request they need to review and merge.

Impact on time a pull request is open

One of the hopes of a programmer who designs a bot is that it will cut down on the amount of time that the developer has to spend on simple repetitive tasks. A bot like bors is designed to run the CI suite automatically, leaving the developer more time to do other things, like review other pull requests. Maybe that means pull requests get merged faster?

To test the impact of bors on the amount of time a pull request is open, we turn to the Two-means hypothesis test. It tells you whether there’s a statistical difference between the means of two different data sets. In our case, we compare the length of time a pull request is open. The two populations are the pull requests a year before and a year after bors was introduced.

rust-bors-pr-open

We would hope to see the average open time of a pull request go down after bors was introduced, but that’s not what the data shows. The graph shows the length of time actually increased, with an increase of 1.1 days.

What about rust-highfive? We would hope that a bot that recommends a reviewer would cause pull requests to get closed sooner.

rust-bors-pr-open

The graph shows there’s no statistical evidence that rust-highfive made a difference in the length of time pull requests were open.

These results seemed odd to me, so I did a little bit of digging to generate a graph of the average time a pull request is open for each month:

rust-pr-open-trend

The length of time pull requests are open has been increasing for most of the Rust project history. That explains why comparing pull request age before and after bors showed an increase in the wait time to get a pull request merged. The second line shows the point that rust-highfive was introduced, and we do see a decline in the wait time. Since the decrease is almost symmetrical with the increase the year before, the average was the same for the two years.

Summary

What can we conclude about github bots from all this statistics?

We can prove with 99% confidence that adding the bors bot to automatically merge changes after it passed the CI tests had no impact on the chances of a random pull request getting merged.

We can prove with 99% confidence that rust-highfive increases a Rust developer’s chances of getting code merged, by as much as 11.7%. The bot initially helped lower the amount of time developers had to wait for their pull requests to be merged, but something else changed in May 2015 that caused the wait time to increase again. I’ll note that Rust version 1.0 came out on May 2015. Rust developers may have been more cautious about accepting pull requests after the API was frozen or the volume of pull requests may have increased. It’s unclear without further study.

This is awesome, can I help?

If you’re interested in metrics analysis for your community, please leave a note in the comments or drop an email to my consulting business, Otter Tech. I could use some help identifying the github usernames for bots in other communities I’m studying:

This blog post is part of a series on open source community metrics analysis:

Part 1: Measuring the Impact of Negative Language on FOSS Participation

You can find the open source FOSS Heartbeat code and FOSS community metrics on github. Thank you to Mozilla, who is sponsoring this research!

Measuring the Impact of Negative Language on FOSS Participation (Part I)

A recent academic paper showed that there were clear differences in the communication styles of two of the top Linux kernel developers (“Differentiating Communication Styles of Leaders on the Linux Kernel Mailing List”). One leader is much more likely to say “thank you” while the other is more likely to jump into a conversation with a “well, actually”.

Many open source contributors have stories of their patches being harshly rejected. Some people are able to “toughen up” and continue participating, and others will move onto a different project. The question is, how many people end up leaving a project due to harsh language? Are people who experience positive language more likely to contribute more to a project? Just how positive do core open source contributors need to be in order to attract newcomers and grow their community? Which community members are good at mentoring newcomers and helping them step into leadership roles?

I’ve been having a whole lot of fun coming up with scientific research methods to answer these questions, and I’d like to thank Mozilla for funding that research through their Participation Experiment program.
words

How do you measure positive and negative language?

The Natural Language Processing (NLP) field tries to teach computers to parse and derive meaning from human language. When you ask your phone a question like, “How old was Ada Lovelace when she died?” somewhere a server has to run a speech to text algorithm. NLP allows that server to parse the text into a subject “Ada Lovelace” and other sentence parts, which allows the server to respond with the correct answer, “Ada Lovelace died at the age of 36”.

Several open source NLP libraries, including the Natural Language Toolkit (NLTK) and Standford CoreNLP also include sentiment analysis. Sentiment analysis attempts to determine the “tone” and objectiveness of a piece of text. I’ll do more of a deep dive into sentiment analysis next month in part II of this blog post. For now, let’s talk about a more pressing question.
wocintech (microsoft) - 62

How do you define open source participation?

On the surface, this question seems so simple. If you look at any github project page or Linux Foundation kernel report or Open Stack statistics, you’ll see a multitude of graphs analyzing code contribution statistics. How many lines of code do people contribute? How frequently? Did we have new developers contribute this year? Which companies had the most contributions?

You’ll notice a particular emphasis here, a bias if you will. All these measurements are about how much code an individual contributor got merged into a code base. However, open source developers don’t act alone to create a project. They are part of a larger system of contributors that work together.

In order for code or documentation to be merged, it has to be reviewed. In open source, we encourage peer review in order to make sure the code is maintainable and (mostly) free of bugs. Some reports measure the work maintainers do, but they often lack recognition for the efforts of code reviewers. Bug reports are seen as bad, rather than proof that the project is being used and its features are being tested. People may measure the number of closed vs open bug reports, but very few measure and acknowledge the people who submit issues, gather information, and test fixes. Open source projects would be constantly crashing without the contribution of bug reporters.

All of these roles (reviewer, bug reporter, debugger, maintainer) are valuable ways to contribute to open source, but no one measures them because the bias in open source is towards developers. We talk even less about the vital non-coding contributions people do (conference planning, answering questions, fund raising, etc). Those are invaluable but harder to measure and attribute.

For this experiment, I hope to measure some of the less talked-about ways to contribute. I would love to extend this work to the many different contributions methods and different tools that open source communities use to collaborate. However, it’s important to start small, and develop a good framework for testing hypothesis like my hypothesis about negative language impacting open source participation.

does it measure up?

How do you measure open source participation?

For this experiment, I’m focusing on open source communities on github. Why? The data is easier to gather than projects that take contributions over mailing lists, because the discussion around a contribution is all in one place, and it’s easy to attribute replies to the right people. Plus, there are a lot of libraries in different languages that provide github API wrappers. I chose to work with the github3.py library because it still looked to be active and it had good documentation.

Of course, gathering all the information from github isn’t easy when you want to do sentiment analysis over every single community interaction. When you do, you’ll quickly run into their API request rate limit of 5,000 requests per hour. There are two projects that archive the “public firehose” of all github events: http://githubarchive.org and http://ghtorrent.org However, those projects only archive events that happened after 2011 or 2012, and some of the open source communities I want to study are older than that. Plus, downloading and filtering through several terabytes of data would probably take just as long as slurping just the data I need through a smaller straw (and would allow me to avoid awkward conversations with my ISP).

For my analysis, I wanted to pull down all open and closed issues and pull requests, along with their comments. For a community like Rust, which has been around since 2010, their data (as of a week or two ago) looks like this:

  • 18,739 issues
  • 18,464 pull requests
  • 182,368 comments on issues and pull request
  • 31,110 code review comments

Because of some oddities with the github API (did you know that an issue json data can be for either an issue or a pull request?), it took about 20 hours to pull down the information I need.

I’m still sorting through how exactly I want to graph the data and measure participation over time. I hope to have more to share in a week!

*Edit* The code is available on github, and the reports for various open source communities are also available.

“I was only joking”

There was a very interesting set of tweets yesterday that dissected the social implications of saying, “I was only joking.” To paraphrase:

I’ve been mulling on the application of this analysis of humor with respect to the infamous “Donglegate” incident. Many men in tech responded with anger and fear over a conference attendee getting fired over a sexist joke. “It was only a joke!” they cried.

However, the justification falls flat if we assume that you’re never “just joking” and that jokes define in groups or out groups. The sexist joke shared between two white males (who were part of the dominant culture of conferences in 2013) defined them as part of the “in-group” and pushed the African American woman who overhead the “joke” into the “out-group”.

When the woman pushed back against the joke in by tweeting about it with a picture of the joker, the people who were part of the in-group who found that joke “funny” were angry. When the joker was fired, it was a sign that they were no longer the favored, dominant group. Fear of loss of social status is a powerful motivator, which is what caused people from the joke’s “in-group” to call for the woman to be fired as well.

Of course, it wasn’t all men who blasted the woman for reacting to a “joke”. There were many women who blasted the reporter for “public shaming”, or who thought the woman was being “too sensitive”, or rushed to reassure men that they had never experienced sexist jokes at conferences. Which brings us to the topic of “chill girls”:

The need for women to fit into a male-dominated tech world means that “chill girls” have to laugh at sexist jokes in order to be part of the “in-group”. To not laugh, or to call out the joker, would be to resign themselves to the “out-group”.

Humans have a fierce need to be socially accepted, and defining in-groups and out-groups is one way to secure that acceptance. This is exemplified in many people’s push back against what they see as too much “political correctness”.

For example, try getting your friends to stop using casually abelist terms like “lame”, “retarded”, “dumb”, or “stupid”. Bonus points if you can get them to remove classist terms like “ghetto” or homophobic statements like “that’s so gay”. What you’ll face are nonsense arguments like, “It’s just a word.” People who call out these terms are berated and no longer “cool”. Unconsciously or consciously, the person will try to preserve the in-groups and out-groups, and their own power from being a part of the in-group.

Stop laughing awkwardly. Your silence is only lending power to oppression. Start calling out people for alienating jokes. Stop preserving the hierarchy of classism, ablism, homophobia, transphobia, and sexism.

White Corporate Feminism

When I first went to the Grace Hopper Celebration of Women in Computing conference, it was magical. Being a woman in tech means I’m often the only woman in a team of male engineers, and if there’s more than one woman on a team, it’s usually because we have a project manager or marketing person who is a woman.

Going to the Grace Hopper conference, and being surrounded by women engineers and women computer science students, allowed me to relax in a way that I could never do in a male-centric space. I could talk with other women who just understood things like the glass ceiling and having to be constantly on guard in order to “fit in” with male colleagues. I had crafted a persona, an armor of collared shirts and jeans, trained myself to interrupt in order to make my male colleagues listen, and lied to myself and others that I wasn’t interested in “girly” hobbies like sewing or knitting. At Grace Hopper, surrounded by women, I could stop pretending, and try to figure out how to just be myself. To take a breath, stop interrupting, and cherish the fact that I was listened to and given space to listen to others.

However, after a day or so, I began to feel uneasy about two particular aspects of the Grace Hopper conference. I felt uneasy watching how aggressively the corporate representatives at the booths tried to persuade the students to join their companies. You couldn’t walk into the ballroom for keynotes without going through a gauntlet of recruiters. When I looked around the ballroom at the faces of the women surrounding me, I realized the second thing that made me uneasy. Even though Grace Hopper was hosted in Atlanta that year, a city that is 56% African American, there weren’t that many women of color attending. We’ve also seen the Grace Hopper conference feature more male keynote speakers, which is problematic when the goal of the conference is to allow women to connect to role models that look like them.

When I did a bit of research for this blog post, I looked at the board member list for Anita Borg Institute, who organizes the Grace Hopper Conference. I was unsurprised to see major corporate executives hold the majority of Anita Borg Institute board seats. However, I was curious why the board member page had no pictures on it. I used Google Image search in combination with the board member’s name and company to create this image:
anita-borg-board

My unease was recently echoed by Cate Huston, who also noticed the trend towards corporations trying to co-opt women’s only spaces to feed women into their toxic hiring pipeline. Last week, I also found this excellent article on white feminism, and how white women need to allow people of color to speak up about the problematic aspects of women-only spaces. There was also an interesting article last week about how “women’s only spaces” can be problematic for trans women to navigate if they don’t “pass” the white-centric standard of female beauty. The article also discusses that by promoting women-only spaces as “safe”, we are unintentially promoting the assumption that women can’t be predators, unconsciously sending the message to victims of violent or abusive women that they should remain silent about their abuse.

So how to do we expand women-only spaces to be more inclusive, and move beyond white corporate feminism? It starts with recognizing the problem often lies with the white women who start initiatives, and fail to bring in partners who are people of color. We also need to find ways to fund inclusive spaces and diversity efforts without big corporate backers.

We also need to take a critical look at how well-meaning diversity efforts often center around improving tech for white women. When you hear a white male say, “We need more women in X community,” take a moment to question them on why they’re so focused on women and not also bringing in more people of color, people who are differently abled, or LGBTQ people. We need to figure out how to expand the conversation beyond white women in tech, both in external conversations, and in our own projects.

One of the projects I volunteer for is Outreachy, a three-month paid internship program to increase diversity in open source. In 2011, the coordinators were told the language around encouraging “only women” to apply wasn’t trans-inclusive, so they changed the application requirements to clarify the program was open to both cis and trans women. In 2013, they clarified that Outreachy was also open to trans men and gender queer people. Last year, we wanted to open the program to men who were traditionally underrepresented in tech. After taking a long hard look at the statistics, we expanded the program to include all people in the U.S. who are Black/African American, Hispanic/Latin@, American Indian, Alaska Native, Native Hawaiian, or Pacific Islander. We want to expand the program to additional people who are underrepresented in tech in other countries, so please contact us if you have good sources of diversity data for your country.

But most importantly, white people need to learn to listen to people of color instead of being a “white savior”. We need to believe people of color’s lived experience, amplify their voices when people of color tell us they feel isolated in tech, and stop insisting “not all white women” when people of color critique a problematic aspect of the feminist community.

Trying to move into more intersectional feminism is one of my goals, which is why I’m really excited to speak at the Richard Tapia Celebration of Diversity in Computing. I hadn’t heard of it until about a year ago (probably because they have less corporate sponsorship and less marketing), but it’s been described to me as “Grace Hopper for people of color”. I’m excited to talk to people about open source and Outreachy, but most importantly, I want to go and listen to people who have lived experiences that are different from mine, so I can promote their voices.

If you can kick in a couple dollars a month to help me cover costs for the conference, please donate on my Patreon. I’ll be writing about the people I meet at Tapia on my blog, so look for a follow-up post in late September!

Code of Conduct Warning Signs

I’ve got something on my chest that needs to be expressed. It’s likely to be a bit ranty, because I’ve got some scars around dealing with this issue. I want to talk about Codes of Conduct (CoCs).

No Trespassing!

Over the last five years, I’ve watched the uptick in adoption of CoCs in open source conferences. I’ve watched conferences try to adopt a CoC and fall completely flat on their face because they completely misunderstood the needs of minorities at their conferences. In recent years, I’ve watched open source communities start to adopt CoCs. For some communities, a CoC is an after thought, a by-product of community leadership stepping up in many different ways to increase diversity in open source.

However, a worrysome trend is happening: I see communities starting to adopt Codes of Conduct without thinking through the implications of them. A CoC has become a diversity checkmark.

Why is this? Perhaps it’s because stories of harassment has become wide spread. People look at the abuse that G4mer Goobers have thrown at women developers, especially women of color and trans women, and they say, “I don’t want those types of people in my community.” For them, a Code of Conduct has become a “No Trespassing” sign for external harassers.

In general, that’s fine. It’s good to stand up to harassers and say, “That’s not acceptable.” People hope that adding a Code of Conduct is like showing garlic to a vampire: they’ll hiss and run off into the darkness.

Pot, meet Kettle

However, a lot of people who are gung-ho about banning anonymous online harassers are often reluctant to clean their own house. They make excuses for the long-standing harassers in their community, and they have no idea how they would even enforce a CoC against someone who is an entrenched member of the community. Someone who organizes conferences. Someone who is a prolific reviewer. Someone who is your friend, your colleague, your drinking buddy.

You see, no one wants to admit that they are “that person”. It’s hard to accept that everyone, including your friends, are unconsciously biased. It’s even harder to admit that your friends are slightly racist/homophobic/transphobic/etc. No one wants to recognize the ablist language they use in their every day life, like “lame”, “dumb”, or “retarded”. It’s tough to admit that your conference speakers are mostly cis white men because you have failed to network with minorities. It’s difficult to come to grips with the fact that your leadership is toxic. It’s embarrassing to admit that you may be too privileged and so lacking in understanding of minorities’ lived experiences that you may need to reach outside your network to find people to help you deal with Code of Conduct incidents.

Code of Conduct Enforcement

And you will have incidents. People will report Code of Conduct violations. The important question is, how will you handle those incidents and enforce your CoC? You’ve put a “No Trespassing” sign up, but are you willing to escort people out of your community? Take their commit access away? Ask them to take a break from the mailing list? If you don’t decide up front how you’re going to enforce your Code of Conduct, you’re going to apply it unfairly. You’ll give your buddy a break, make excuses like, “But I know they’ve been working on that,” or, “Oh, yeah, that’s just so-and-so, they don’t mean that!”

You need to decide how you’ll enforce a Code of Conduct, and find diverse leadership to help you evaluate CoC violations. And for the love of $deity, if the minorities and louder allies on your enforcement committee say something is a problem, believe them!

Let’s fork it!

Another worrisome trend I see is that the people working on creating Codes of Conduct are not talking to each other. There is so much experience in the open source community leadership in enforcing Codes of Conduct, but it’s become a bike shed issue. Communities without experience in CoC enforcement are saying, “I’ll cherry-pick this clause from this CoC, and we’ll drop that clause because it doesn’t make sense for our community.”

We don’t write legal agreements without expert help. We don’t write our own open source licenses. We don’t roll our own cryptography without expert advice. We shouldn’t roll our own Code of Conduct.

Why? Because if we roll our own Code of Conduct without expert help, it creates a false sense of security. Minorities who rely on a Code of Conduct to grant them safety in an open source community will get hurt. If leadership is implementing a Code of Conduct as a diversity check mark, it papers over the real problem of a community that is unwilling to put energy into being inclusive.

Diversity Check Mark Complete!

I also see smaller communities scrambling to get something, anything, in place to express that they’re a safe community. So they take a standard Code of Conduct and slap it into place, without modifying it to express their communities’ needs. They don’t think about what behaviors they want to encourage in order to make their community a safe place to learn, create, and grow. They don’t think about how they could attract and retain diverse contributors (hint, I recently talked about some ideas on that front). They don’t think about the steps that they as leaders need to take in order to expand their understanding of minorities’ lived experiences, so that they can create a more inclusive community. They don’t think about the positive behaviors they want to see in their community members.

When I see an unmodified version of a Code of Conduct template in a community, I know the leadership has put up the “No Trespassing” sign to stop external harassers from coming in. But that doesn’t mean the community is inclusive or diverse. It could be a walled garden, with barriers to entry so high that only white men with unlimited amounts of spare time and a network of resources to help them can get inside. It could be a barb-wire fence community with known harassers lurking inside. Or it could be a community that simply found another CoC was good enough for them. I can’t know the difference.

Ask for Expert Advice

My take away here is that implementing a Code of Conduct is a hard, long, process of cultural change that requires buy-in from the leadership in your community. Instead of having an all-out bike-shed thread on implementing a CoC, where people cherry-pick legal language without understanding the implementation details of removing that language, go talk with an expert. Safety First PDX, Ashe Dryden, and Frame Shift Consulting are happy to provide consulting, for a fee. If you don’t have money to pay them (and you should pay women for the emotional labor they do to create welcoming communities!), then you’ll need to spend a bunch of time educating yourself.

Read *everything* that Safety First PDX has to say about Code of Conduct design and enforcement. Read the HOW-TO design a Code of Conduct post on the Ada Initiative website. Watch Audrey Eschright talk about Code of Conduct enforcement. Look at the community code of conduct list on the Geek Feminism wiki. These are all a long reads, but these are known experts in the field who are offering their expertise to keep our open source communities safe.

In Conclusion

Don’t roll your own Code of Conduct without expert advice. You wouldn’t roll your own cryptography. At the same time, don’t make a Code of Conduct into a check mark.