Very cool. <br><br>What about contextually-dependent characters? For example when tokenizing English, an contraction apostrophe is actually part of the word, but an ending single quote is not. Would this be up to the parser to disambiguate?<br>
<br>Weston<br><br><div class="gmail_quote">On Mon, Apr 5, 2010 at 1:29 PM, James Tauber <span dir="ltr"><<a href="mailto:jtauber@jtauber.com">jtauber@jtauber.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">
As it happens, I was working on a tokenizer last night (was the reason for this tweet: <a href="http://twitter.com/jtauber/status/11611451291" target="_blank">http://twitter.com/jtauber/status/11611451291</a> ). I had planned to put it up on GitHub (as part of a general "text tools" project) but here is the current state:<br>
<br>
<br>
The core of it is just a two-line python function:<br>
<br>
--------------------<br>
<br>
from itertools import groupby<br>
<br>
def tokenize(stream, chr_classes):<br>
"""<br>
tokenize the given stream based on the given character classes.<br>
<br>
chr_classes should be a dictionary mapping character class label to a<br>
string of member unicode characters::<br>
{<br>
"numbers": u"0123456789",<br>
"whitespace": u" \n",<br>
}<br>
"""<br>
# build reverse index from character to character class<br>
idx = dict((ch, chr_class) for chr_class, chrs in chr_classes.items() for ch in chrs)<br>
<br>
# tokenize text<br>
return groupby((ch for line in stream for ch in line.decode("utf-8")), idx.get)<br>
<br>
--------------------<br>
<br>
So, for example you set up your character classes (yes, I could have just defined ranges but, in my code last night, I was being explicit about the character appearing in the particular text I was tokenizing)<br>
<br>
<br>
--------------------<br>
<br>
def u(s):<br>
"""<br>
convert utf-8 encoded string to unicode.<br>
"""<br>
return s.decode("utf-8")<br>
<br>
<br>
CHR_CLASSES = {<br>
"editorial": u("[]‹›()"),<br>
"letters": u(<br>
"ΒΓΔΖΘΚΛΜΝΞΠΣΤΦΧΨ" "ΑΕΗΟ" "Ῥ"<br>
"ἈἊἉἌἍἙἘἝἜἚἩἨἬἭἹἸἽἼὍὉὊὈὌὙὩὭὫὨ"<br>
"βγδζθκλμνξπρσςτφχψ" "αεηιουω" "ῥῤ"<br>
"ἀἁάὰᾶἄἅἂᾳᾷἃἆᾴᾄ" "ἐἑέὲἔἕἓ" "ἡήὴῆἢἤῃῄῇἥἦἧᾐἠᾖἣᾗ"<br>
"ἰἱίὶῖἶἷἴἵϊἳΐῒ" "ὀὁόὸὅὃὄὂ" "ὐὑύὺῦὔὖὕὓ"<br>
"ὡώὼῶὧὥῳᾠῷῴᾧὦὤὠὢᾡὣ"<br>
"΄"<br>
),<br>
"whitespace": u(" \n"),<br>
"numbers": u("1234567890"),<br>
"punctuation": u(".,·;“”"),<br>
"temp": u("†-"),<br>
}<br>
<br>
--------------------<br>
<br>
and then you're good to go...<br>
<br>
<br>
--------------------<br>
<br>
import sys<br>
FILENAME = sys.argv[1]<br>
<br>
for chr_class, token in tokenize(open(FILENAME), CHR_CLASSES):<br>
print "".join(token).encode("utf-8"), chr_class<br>
<br>
--------------------<br>
<br>
<br>
James<br>
<div><div></div><div class="h5"><br>
<br>
On Apr 5, 2010, at 1:44 PM, Weston Ruter wrote:<br>
<br>
> DM:<br>
><br>
> But what we really need is not a parser but a tokenizer. I'm thinking about writing one (my degree work was in compiler writing). Basically, we repeat the same tokenization code in several places. It should be trivial to write a complete, accurate one.<br>
><br>
> I've also been wanting to work on a tokenizer. At Open Scriptures, the text of a work is currently represented by two models (database tables): Token and Structure. Tokens are the smallest divisible units of text, such as words, punctuation, and whitespace; and structures are the spans of tokens that form logical units, such as verses, paragraphs, quotes, etc. The structures are standoff-markup for the tokens. With the underlying data stored in this way, it can then be serialized in whichever hierarchy desired (book-section-paragraph, book-chapter-verse, all-milestoned, etc) or whichever data format is needed (OSIS, SWORD Module, XHTML, etc.)<br>
><br>
> So what I'm currently rumenating on is the process of importing the raw data into the Token and Structure models. I wrote an importer for the Tischendorf GNT data which does everything both tokenizing and parsing, but obviously there is going to be a lot of code in common with other importers that are written. So I too am thinking about how these importers can be reduced to the bare minimum to handle the unique aspects of the raw data (i.e. normalize it), and then stream the tokens back to a central importer that parses the input and stores it into the Token and Structure models. This central importer facility could be a web service.<br>
><br>
> I've love to collaborate with you on this. We could come up with a common tokenizer that can be used by both SWORD and Open Scriptures. The importer web service could take tokens as input and as output generate a SWORD module and also populate the Open Scriptures models at the same time.<br>
><br>
> Thoughts?<br>
><br>
> Weston<br>
><br>
><br>
><br>
> On Mon, Apr 5, 2010 at 10:24 AM, Daniel Owens <<a href="mailto:dhowens@pmbx.net">dhowens@pmbx.net</a>> wrote:<br>
> Yes, I agree, and if there were a feedback mechanism for the module creator to let them know how to start fixing an OSIS file or conf file, it would save Chris (or whoever else approves modules) time on the basic stuff.<br>
><br>
> Daniel<br>
><br>
><br>
> On 4/5/2010 11:09 AM, DM Smith wrote:<br>
> This is a great idea. Rather than emailing source to modules at crosswire dot org, one could upload it via a web service. We could have stages of validation (xmllint) and construction (osis2mod). Such a service could evaluate the quality of the submission.<br>
><br>
> In Him,<br>
> DM<br>
><br>
> On 04/05/2010 12:01 PM, Weston Ruter wrote:<br>
> Why not turn osis2mod into a web service? Then it wouldn't matter how it is implemented since it would be abstracted away by the web service interface. It could use the best XML libraries available today and written in the programming language of choice, both of which would make maintenance and the addition of new features much easier.<br>
><br>
> Weston<br>
><br>
><br>
><br>
> On Mon, Apr 5, 2010 at 9:05 AM, DM Smith <<a href="mailto:dmsmith@crosswire.org">dmsmith@crosswire.org</a>> wrote:<br>
> On 04/05/2010 09:03 AM, Dmitrijs Ledkovs wrote:<br>
> On 5 April 2010 13:55, Manfred Bergmann<<a href="mailto:manfred.bergmann@me.com">manfred.bergmann@me.com</a>> wrote:<br>
><br>
> Hi DM.<br>
><br>
> Am 05.04.2010 um 13:21 schrieb DM Smith:<br>
><br>
><br>
> Regarding using a "real" parser, it is a good idea. But we don't want SWORD to be dependant on an external parser.<br>
><br>
> What's the reason for that?<br>
> I could understand if it would mean for the user to install certain libraries manually but when the sources can be integrated into the project and has the appropriate licence then why not?<br>
><br>
><br>
> Manfred<br>
><br>
><br>
> IMHO there is no harm in bringing in libxml or a much more lightweight<br>
> parser like GMarkup. The build system just needs to be adjusted to<br>
> link e.g. libxml for the osis2mod binary and not shared sword library.<br>
> in can be even called a new tool osisxml2mod for example and make it<br>
> be build optionally such that you can still have full sword dev<br>
> environment without libxml.<br>
><br>
> Tools for creating modules do not have be linked with sword or even<br>
> live in sword taball / svn. Although it does help consistent<br>
> distribution of tools.<br>
><br>
> I don't remember all of Troy's reasoning when I argued for a true parser.<br>
><br>
> From what I recall:<br>
> o To maintain freedom to re-license SWORD (e.g. for some other Bible society) we need to be able to keep 3-rd party library dependencies well managed. The license needs to be compatible with the GPL but cannot be GPL.<br>
><br>
> o The parser that we have is minimal and simple, sacrificing accuracy and completeness for speed. Regarding accuracy, e.g. the parser allows for spaces around = in attribute declarations. Regarding completeness, e.g. it does not handle namespaces, cdata, dtds/schemas, .... Significantly, it does not require a well-formed document, allowing for fragments. Rather than an error, it continues when an xml parser is required to stop.<br>
><br>
> o This parser has better error reporting in that it is based upon knowledge of the input. E.g. it reports the verse having the problem.<br>
><br>
> o By SWORD having the parser, we are not dependent on finding an implementation for every platform (e.g. Windows).<br>
><br>
> There may be other reasons. I'm willing to live with it.<br>
><br>
> But what we really need is not a parser but a tokenizer. I'm thinking about writing one (my degree work was in compiler writing). Basically, we repeat the same tokenization code in several places. It should be trivial to write a complete, accurate one.<br>
><br>
> In His Service,<br>
> DM<br>
><br>
><br>
> _______________________________________________<br>
> sword-devel mailing list: <a href="mailto:sword-devel@crosswire.org">sword-devel@crosswire.org</a><br>
> <a href="http://www.crosswire.org/mailman/listinfo/sword-devel" target="_blank">http://www.crosswire.org/mailman/listinfo/sword-devel</a><br>
> Instructions to unsubscribe/change your settings at above page<br>
><br>
><br>
</div></div>> --<br>
> You received this message because you are subscribed to the Google Groups "Open Scriptures" group.<br>
> To post to this group, send email to <a href="mailto:open-scriptures@googlegroups.com">open-scriptures@googlegroups.com</a>.<br>
> To unsubscribe from this group, send email to <a href="mailto:open-scriptures%2Bunsubscribe@googlegroups.com">open-scriptures+unsubscribe@googlegroups.com</a>.<br>
> For more options, visit this group at <a href="http://groups.google.com/group/open-scriptures?hl=en" target="_blank">http://groups.google.com/group/open-scriptures?hl=en</a>.<br>
<font color="#888888"><br>
--<br>
You received this message because you are subscribed to the Google Groups "Open Scriptures" group.<br>
To post to this group, send email to <a href="mailto:open-scriptures@googlegroups.com">open-scriptures@googlegroups.com</a>.<br>
To unsubscribe from this group, send email to <a href="mailto:open-scriptures%2Bunsubscribe@googlegroups.com">open-scriptures+unsubscribe@googlegroups.com</a>.<br>
For more options, visit this group at <a href="http://groups.google.com/group/open-scriptures?hl=en" target="_blank">http://groups.google.com/group/open-scriptures?hl=en</a>.<br>
<br>
</font></blockquote></div><br>