Home » Posts tagged 'algorithm-generated content'

Tag Archives: algorithm-generated content

Computers and Crowds: Unexpected Authors and Their Impact on Scholarly Research

On Friday, May 17, nearly 50 librarians from CUNY and other New York City libraries gathered at the CUNY Graduate School of Journalism to participate in a program about new models for content production. This exciting program was jointly organized by the LACUNY Emerging Technologies Committee, the LACUNY Scholarly Communications Roundtable, LILAC, and the Office of Library Services.

The morning began with a lively presentation from Kate Peterson, Information Literacy Librarian at the University of Minnesota-Twin Cities, and Paul Zenke, DesignLab/Digital Humanities Initiative Project Assistant at the University of Wisconsin-Madison. In their presentation, titled “Hats, Farms, and Bubbles: How Emerging Marketing & Content Production Models are Making Research More Difficult (And What You and Your Students Can Do About It),” Kate and Paul discussed five initiatives that currently affect content creation and propagation on the internet: search engine optimization (SEO), filter bubbles, content farms, algorithm-created content, and crowdsourcing (see their slides from this talk in the Program Materials section below).

The session began with an active poll in which attendees were asked to walk to labeled parts of the room to show the audience’s familiarity with each of the five concepts. With a range of prior knowledge among attendees, you could see through this activity that everyone had something to learn from the presentation.

The first item that was discussed was SEO: techniques used to increase the visibility of a website to search engines. Paul noted that while all website owners want their sites to be found, practitioners of “black hat” SEO typically use content spam (hiding or manipulating text) or link spam (increasing the number of links to a website) to try and trick search engine ranking algorithms into ranking their sites highly. Some search engines have tried to mitigate the effects of SEO: in 2012 Google launched Penguin which provides guidelines for webmasters and applies a penalty to websites that violate the guidelines.

Next Kate explained the concept of a filter bubble, a term that describes the potential for different search engine results when two identical searches are performed on two different computers (remember those Google ads that highlighted personalized searching for a beetle – the bug vs. the car?). The term filter bubble was coined by Eli Pariser in his book of the same name; we watched a brief clip of Pariser’s TED talk in which he explained the dangers of filter bubbles. When search engine algorithms increasingly tailor search engine results to our interests – which they equate with whatever content we click on while web surfing – we aren’t seeing the full range of information available on the internet. Facebook uses similar techniques to display content based on our friends’ interests. By creating these filter bubbles, internet corporations are restricting the opportunities for us to encounter information that may be new or challenging to us, or present a different point of view from our own.

Most academic librarians are familiar with content farms: websites that pay very low wages to freelancers to write large volumes of low quality articles, sites like About.com, Ehow.com, and others. Often the article topics are drawn from algorithmic analysis of search data that suggests titles and keywords that are most profitable for advertisers – unlike journalism, this model of content creation starts with consumer demand. Paul noted that Google has also come out with a strategy to attempt to stem the tide of low quality content from content farm websites; in 2011 it debuted Google Panda and downgraded 11% of content it indexed that year. While it’s useful to us, as librarians, when Google addresses the content farm problem, it’s also somewhat troubling to realize that Google is developing algorithms for evaluating information sources.

Perhaps one of the most surprising topics discussed was content created by machine, or algorithm-generated content. Algorithms have already been implemented to synthesize large data sets into an accessible narrative. They are popular in areas like sports writing or business news where there is an emphasis on statistics and identifying trends or patterns. But algorithms are also already being used to write content such as restaurant reviews or haikus. These algorithms can even be programmed to generate a certain tone within the article, or make different types of articles for different situations using the same data. Other ways they have been used in academic settings might be to give students feedback on their preparation for tests like the SAT or ACT. One point of discussion during the event was the labor issues with algorithms (or lack thereof) — the incentive to use algorithms to create content eliminates the need to pay any person (no author is paid even just a small amount, as with content farms, because essentially there is no author). A question from the audience brought up the dying art of fact checking in journalism today. Kate pointed out, interestingly, that although these articles are not written by a person, they need very little fact-checking, since they rely so heavily upon the direct import of factual data.

Crowdsourcing was also discussed as an emerging way content is created or supported through the work of the masses. Paul briefly discussed content created through crowdsourcing such as is done on web sites like Carnegie Mellon’s Eterna and the site Foldit where contributors play a game involving protein folding. He also  focused on crowdsourcing for fundraising using web sites like Kickstarter and (Indiegogo. There are implications for what people decide to fund and not to fund. What does this mean especially in these times of federal austerity?

During a following breakout session the crowdsourcing topic was explored further. Examples of user supplied content included Wikipedia. Wikipedia Editathons such as the one held at NYPL to increase access to the NYPL theater and performing arts collection were noted. MOOCs became a part of the discussion on crowdsourcing, where examples of student solutions to problems have been integrated as illustration in a course. Readersourcing.org, though it was never launched, was an attempt to crowdsource peer review. There was also an extended discussion about crowdsourcing as a news gathering technique. Twitter has surfaced as a way to gather information about events as they happen. Of concern for librarians, always interested in the accuracy of information, is whether or not information gathered through Twitter can be trusted. Additionally, Daren C. Brabham’s research on the ethics of using the crowd as a source of free labor was also discussed. According to Brabham, a myth is perpetuated about the amateur nature of crowd contributors, when in reality many who contribute as anything from citizen scientist to citizen graphic designer are often professionals who deserve compensation.

Kate and Paul ended by suggesting strategies that we can use to mitigate potentially negative effects of new content production, both for us — as librarians and as internet users — and for the students and faculty with whom we work. And indeed, academic librarians are well-placed to implement these recommendations as we work to educate ourselves and our patrons. We must continue to teach students to evaluate their sources, and perhaps expand to evaluating the possible filters they experience as well. Looking for opportunities to create more chance encounters with information could help burst those bubbles. Many of us already clear our web browser history and cookies regularly; can we also demand more transparency from our vendors about the information they collect from users? Finally, Kate and Paul challenged us to think about ways that we can put students into the role of creator — rather than simply consumer —  to raise their awareness about these issues surrounding content production and increase their data literacy and information literacy.

After Kate and Paul’s presentation, participants broke up into three discussion groups: content farms (led by Paul), algorithms (led by Kate) and crowdsourcing (led by Prof. Beth Evans). Participants explored the implications of each of these topics for work in the library, and also discussed other issues surrounding research and the internet.

Awareness of all of these issues might help to insure that librarians and researchers (and the students we teach at the reference desk and in the classroom) don’t get stuck in the filter bubble, surrounded by thin information that was written by bots!

— by Maura Smale (City Tech), Alycia Sellie (Brooklyn College), and Beth Evans (Brooklyn College)

Program Materials:

Hats, Farms, and Bubbles slides

Videos shown during the presentation:

Epipheo. (2013). Why the News Isn’t Really the News. Youtube. http://www.youtube.com/watch?v=YoZNJsp3Kik

ExLibrisLtd. (2011). Primo ScholarRank plain and simple. YouTube. http://www.youtube.com/watch?v=YDly9qPpPYQ

Ted. (2011). Eli Pariser: Beware online “filter bubbles.” http://www.ted.com/talks/eli_pariser_beware_online_filter_bubbles.html

Additional materials mentioned during the presentation:

On The Media. (2013). Ads vs. Ad-Blockers. http://www.onthemedia.org/2013/may/10/ads-vs-ad-blockers

  • In response to the question about how modifying your web browser through extensions like ad blockers can have unintended consequences like hurting independent publishers.

This American Life. (2012). Forgive us our press passes. 468: Switcheroo. http://www.thisamericanlife.org/radio-archives/episode/468/switcheroo?act=2

  • Although we didn’t mention Journatic.com during our presentation, it’s another version of a content farm but instead of using SEO techniques to attract web traffic from a general audience, Journatic.com works with newspapers to outsource hyper-local articles to writers abroad who often publish under fake bylines.

Recommended Readings:

1) Content Farms

NOTE: Notice the use of SEO in the web address; the article is NOT about ESPN.

2) Algorithm-written Content

3) Crowdsourcing

Crowdsourcing Site Screenshots, by Beth Evans (Brooklyn): http://www.slideshare.net/myspacelibrarian/crowdsourcing-site-screenshots

Unexpected Authors and Their Impact on Scholarly Research

The LACUNY Emerging Technologies Committee, the LACUNY Scholarly Communications Roundtable, LILAC, and the Office of Library Services are delighted to announce our Spring program:

Computers and Crowds:
Unexpected Authors and Their Impact on Scholarly Research

Friday, May 17th; 9:30am – 12:30pm
Graduate School of Journalism, Room 308

Register online.

Please join us for an exciting half-day session that begins with an introduction to new content production models and ends with a moderated breakout discussions of specific topics in the field.

Part 1:
Hats, farms, and bubbles: How emerging marketing & content production models are making research more difficult (and what you and your students can do about it)

Description:
Google, and other search engines, have made tremendous progress organizing the world’s knowledge. However, accessing that knowledge is becoming increasingly difficult because of emerging marketing and content production models utilized by high-ranking sites like eHow.com and ExpertVillage.com. Search Engine Optimization (SEO), “content farms” and Google’s increasingly personalized search algorithms are making search engines less effective as academic research tools. Therefore students are exposed to more shallow, low quality results than ever before. In this session, learn more about the technologies behind these emerging marketing and content production models. Learn strategies faculty, students, and librarians can use to respond to new information environment.

Speakers:
Kate Peterson
Information Literacy Librarian, University of Minnesota-Twin Cities

Paul Zenke
DesignLab/Digital Humanities Initiative Project Assistant, University of Wisconsin-Madison

Part 2:
Three concurrent breakout conversations on content farms, algorithm-written content, and crowd sourcing. Recommended readings will be made available in advance on the Academic Commons.

Refreshments will be served!