free culture – the Free-Culture Society for Open-source Judaism

November 26, 2010

תנ״ך | A Tale of Two Codexes: The Aleppo and Leningrad Codex

Given that more than 50% of the Siddur is comprised of text from the תנ׳׳ך (TaNaKh) any project that seeks to rigorously attribute its sources depends on a critical, digital edition of the Masoretic text of the Hebrew bible. And such is the case for our Open Siddur Project. The entire history of the transmission of such a profoundly important sourcetext illustrates the degree to which we rely on each others most positive intentions to advance our love of the Torah through sharing — regardless of sect, creed, scholarly or theological inspiration. Moving ahead we are supported by each others gifts and by the preserved legacy of our cultural inheritance.

The oldest complete manuscript of the TaNaKh is the Leningrad Codex (circa 1008 CE) prepared by the school of Aharon Ben Moshe Ben Asher. The grand project of Masoretes during the first millenia was preparing the text of the TaNaKh with their received tradition (masorah) of its annunciation and vocalization. Since these important oral traditions are not transcribed within Torah scrolls, the Masoretes preserved these traditions by writing out the complete text of the TaNaKh with vowels (nikkud) and cantillation marks (trope). The Tiberian system for marking vowels in the Leningrad Codex is the same system used in Hebrew today.

According to modern scholars, Aharon ben Moshe ben Asher followed the Karaite rather than the Rabbinic tradition of Judaism. This may help explain why Aharon ben Asher’s contemporary, Rav Saadia Gaon (892-942 CE) preferred the codexes of another Masoretic school — that of Ben-Naphtali. However, only the codexes of the Ben-Asher school survived, and ultimately, the codexes of the Ben-Asher school were approved by Maimonides (1135-1204 CE). In his Yad ha-Ḥazakah, Maimonides writes:

All relied on it, since it was corrected by Ben-Asher and was worked on and analyzed by him for many years, and was proofread many times in accordance with the masorah, and I based myself on this manuscript in the Sefer Torah that I wrote”.1

This approval is all the more astounding considering Maimonides outstanding objections and disputations with the Karaites of his day.

In the 1830s, Abraham ben Samuel Firkovich, a manuscript collector and ḥakham of the Crimean Karaite Jewish community, visited Constantinople, Jerusalem, and the Cairo Genizah in Egypt. During these travels he received possession of the Leningrad Codex, which was taken to Odessa in 1838 and later transferred to the Imperial Library in St. Petersburg. Used as the sourcetext for the Biblia Hebraica in 1937 and the Biblia Hebraica Stuttgartensia in 1977, the Leningrad Codex was digitized in the 1980s as a collaborative scholarly project organized by the Presbyterian Westminster Theological Seminary‘s J. Alan Groves Center for Advanced Biblical Research.

This text began as an electronic transcription by Richard Whitaker (Princeton Seminary, New Jersey) and H. van Parunak (then at the University of Michigan, Ann Arbor) of the 1983 printed edition of Biblia Hebraica Stuttgartensia (BHS). It was continued with the cooperation of Robert Kraft (University of Pennsylvania) and Emmanuel Tov (Hebrew University, Jerusalem), and completed by Prof. Alan Groves. The transcription was called the Michigan-Claremont-Westminster Electronic Hebrew Bible and was archived at the Oxford Text Archive (OTA) in 1987. It has been variously known as the “CCAT” or “eBHS” text. Since that time, the text has been modified in many hundreds of places to conform to the photo-facsimile of the Leningrad Codex, Firkovich B19A, residing at the Russian National Library, St. Petersburg; hence the change of name. The Groves Center has continued to scrutinize and correct this electronic text as a part of its continuing work of building morphology and syntax databases of the Hebrew Bible, since correct linguistic analysis requires an accurate text.2

The Groves Center decided to share the digital Westminster Leningrad Codex without restriction — a prescient and important decision made prior to the popularization of the Internet and the World Wide Web. Their altruistic decision continues to enable many innovative projects based on the text and study of the TaNaKh. The source of the Westminster Leningrad Codex that we are using for the Open Siddur Project were derived from sources shared by Christopher Kimball at tanach.us. The Internet Sacred Text Archive provides links to the full Westminster Leningrad Codex (with transliteration), here.

This text is derived from the Westminster Leningrad Codex (WLC) of the Westminster Hebrew Institute. Thanks to Christopher V. Kimball, who graciously made the source files for this freely available. This version is based on the October 20th, 2006 WLC release.3

The tragic story of the oldest but unfortunately incomplete Aleppo Codex (circa 10th Century CE) — the codex upon which the Leningrad Codex was first based and corrected against — provides a cautious lesson in contrast. Similar to the Leningrad Codex, the Aleppo Codex was also preserved by Karaite Jews. It was then stolen by Crusaders, ransomed, and later transferred to the Syrian Aleppo community where it was hidden for six centuries and zealously guarded. While the Leningrad Codex was copied and shared at the onset of the Age of Photography, the opportunity to copy and thereby preserve the Aleppo Codex was lost.

…the [Aleppo Jewish] community limited direct observation of the manuscript by outsiders, especially by scholars in modern times. Paul Kahle, when revising the text of the Biblia Hebraica in the 1920s, tried and failed to obtain a photographic copy. This forced him to use the Leningrad Codex instead for the third edition, which appeared in 1937.4

In the immediate aftermath of a deadly riot against Jews and Jewish property in Aleppo in December 1947, much of the five books — the Torah section of the Aleppo Codex — disappeared.

Today, at the onset of the Digital Age, we must preserve the heritage of our culture’s creative inspiration by digitizing our collective work in open standard formats, and sharing the work so its transmission can easily be mirrored and redistributed without difficulty. The Open Siddur Project is committed to preserving the legacy of our diverse communities’ creative inspirations and calls upon all those who love the Torah and earnest spiritual practice to serve their intentions through sharing their intellectual resources.

If you represent an educational institution with copies of work in the public domain, please share digital images or digital transcriptions of this work with public domain declarations such as the Creative Commons Zero Public Domain declaration. For the preservation of our living tradition, the many surviving historic manuscripts witnessing variations of tefillot found in the Siddur, including the oldest surviving manuscripts of the Hebrew Bible, Dead Sea Scroll fragments, Jewish apocryphal and pseudepigraphal text, Cairo Genizah fragments, and the various girsot of the Talmud, need to be made available, freely for redistribution.

November 17, 2010

Openness, remixability, and free culture (Efraim Feinstein, 2010)

Russel Neiss writes “while we have had many illuminating conversations since our presentation [at the JFNA General Assembly], the questions and feedback we have received overwhelmingly surrounds the first value of “Open, Discoverable and Accessible.”” He refers to the four core principles he articulated for Jewish educational material online. That it should be:

Open, Discoverable and Accessible;
Remixable;
Meaningful and Relevant; and
Community Building.

In the secular free culture world, the language is somewhat different, and the difference in emphasis can be illuminating. There, another set of four freedoms have been defined as the bedrock of the movement. In order to be a free culture work, it must give its user:

the freedom to use the work and enjoy the benefits of using it;
the freedom to study the work and to apply knowledge acquired from it;
the freedom to make and redistribute copies, in whole or in part, of the information or expression; and
the freedom to make changes and improvements, and to distribute derivative works.

Freedoms 1 and 2 roughly correspond to Russel’s point number 1. Freedoms 3 and 4 encompass point number 2.

What is perhaps most instructive is that the values of free culture are not defined with respect to the material itself, nor to its content. They are freedoms guaranteed to the user. Material being “open, discoverable, and accessible” is a first step. Simply putting it on the Internet and being indexed by search engines will satisfy this condition.

In the bargain of openness, content creators will have to choose to give up some exclusive rights. In exchange, the work gains a life of its own in the hands of the users, the educators and the students. In my (limited) experience of conversation with content providers, this seems to be the greatest barrier toward freeing educational works that are already made available.

Perhaps remixability is a harder sell to educators and educational content providers than openness because the advantages it provides are further from the originator. Content providers may argue that providing rights to copy material for “personal” or “educational” use satisfies their duty. However, the ability to make and distribute copies solely for limited use leads to dissemination of the material. It does not result in an active culture being developed out of it. It does not result in improvements to the original, or adaptations for differing circumstances from those the original creator envisioned. Even if those adaptations are made locally, they will ultimately be undisseminated, potentially resulting in duplication of labor, or worse, their loss to future creators and users. The absence of remixing rights builds a one-way community of consumers, instead of a multidirectional cooperative community of creators.

There is also the persistent fear of “misuse” of a work. If an author gives up exclusive control over remixes, how does he/she know that the results will still be ideologically compatible with the original? This is again a trade-off necessary for ensuring that users’ creativity can be exercised. Perceived damage to a creators’ reputation from an ideologically differing work can be mitigated by requiring that a modified work bear a notice that it was modified from its original version, and that no endorsement of the modified version by the original author is implied. Further, a web link to the original version may be included as part of the attribution. All Creative Commons free culture licenses (aside from CC0) bear these requirements. Overall, the benefits to the wider culture obtained from many creative minds working on the material outweigh the threats from “misuse.” The choice is between static read-only content and dynamic conversation among the user-creator partners.

Advocacy for creative works’ freedom represents a paradigm shift in thought among content creators: In a free culture, a premium is not placed on the material as-such or even the particular rights associated with the material. Instead, it is on the users’ freedom, and it is that freedom that is the prerequisite to large-scale creative engagement with educational material.

February 9, 2010

An Economic Argument for Open Data by Efraim Feinstein

There are two principles on which the success of data on the contemporary web rests: the web makes content available, and it adds value to that content by linking it to other related information.

When considering bringing old content online, both of these aspects are important. A first level of digitization involves simply making data available. Google Books and Hebrewbooks.org work at this level, providing PDFs and/or OCR-ed transcriptions of the material. A second level of digitization involves semantic linkage of the data, both internal to the site and external to the site. The Open Siddur Project and Open Scriptures digitize at the semantic level. This second-level digitization is required to do all of the cool things we expect to be able to do with online texts: click on a word and find its definition or grammatical form, find the source of a passage in one text in another text, find how the text has evolved historically, etc. Even the simplest form of a link: a reference from another site, requires some kind of internal division.

Digitization that takes advantage of the web therefore requires a number of steps: (1) getting the basic text online, (2) getting it in an addressable form (to make it more like typed text, instead of a picture of a page), (3) assuring the text’s accuracy, and (4) marking it up for semantic linkage. Some of these steps, or parts of them can be done automatically, but, overall, they require some degree of intelligent input. Even step 1, which is primarily mechanical in nature, requires design of the procedures.

I hope that this outline of the required steps to getting a text online suggests that the most expensive part of making content available is human labor — it takes time to do it, and it takes even more time to do it right.

And now for the rhetorical questions:

How many times has the Tanach been digitized?
… the siddur?
… the Talmud?
… major commentaries on the siddur, Torah, Talmud (Rashi, Tosefot)?
… full codes of Jewish law (Mishneh Torah, Tur, Shulchan Aruch, Aruch Hashulchan)?
… uncommon piyyutim (liturgical poems)?

In some cases, the answer is: it’s been done many times. In other cases, the answer is: it’s never been done. And, both answers lead the all-important question: why? Why are there so many digitizations of the Tanach and no full digitizations of Shulchan Aruch online? Why isn’t the siddur already hyperlinked to its Talmudic sources?

I would propose that we have been wasteful with our resources. Earlier, I pointed out that the primary resources that go into these advanced digitizations are time and human labor. In some cases, these resources equate directly to money, in others, the linkage is more indirect.

The core material of all of the above-mentioned works comes from the public domain. It is ownerless, and free for anyone to copy for any purpose. Every time we encounter a basic text that we have to digitize again because of “new copyright” claims or EULA-style contractual constraints, that is an indication of a failure somewhere in the system. This is particularly true if the claims are being made by non-profits, “social” businesses, or academic institutions. In the Jewish world, even for-profit published books are sometimes donation-supported. Each common text that has to be digitized a second, third, or hundredth time equates to another less common text that is not being digitized. Redoing basic OCR work and transcription takes resources away from establishing semantic linkages.

Some people and organizations get it. As of now, we only need one digitization of the Leningrad Codex (Masoretic Bible). That’s because Christopher Kimball and the J. Alan Groves Center for Advanced Biblical Research digitized it, transcribed it, and released it as free data. The Westminster Leningrad Codex is now perhaps the most built-off version of the Hebrew Bible online. The base texts (which may be used “without restriction”) are present in both commercial and non-commercial products. The Open Siddur Project is using it both for its technology demonstrations and as the basis of all biblical texts in the siddur.

There are precious few examples of free data in the Jewish community, even on the Internet. There are copious examples of donation-funded organizations presenting primarily public domain data with new copyright claims.

Free data prevents the necessity of duplication of effort, which, in turn, prevents the community as a whole from unnecessarily wasteful spending. Particularly for organizations with a social mission, its use is a win for everyone.