Generating localized word lists with CeWL and the usual suspects from the POSIX toolbox

Password cracking / guessing is the inevitable pain of penetration testers. When conducting those assessments, beside a good cracking software and decent amount of GPU power that will definitely come handy, what matters is the good word list that will narrow the search space as much as possible. Definitely, easier said than done, especially if your target is from non-English speaking area.

Being myself an English-wise alien, I was thinking of how could I generate word list that could be used against targets coming from my area (Western Balkans) and what properties have to be considered for such a quest. But first, let me explain some specifics about the language(s) that we’ll be targeting.

Around 16 million people in the region speak more-or-less the same language (although each nation likes to call it by their own name, hence Balkanisation), with common grammatical and syntactical characteristics that can be written with both Latin and Cyrillic alphabet. Based on the patterns from some recent database leaks, it is safe to assume that overwhelming majority of people from the region are still using ASCII character set when choosing passwords. This should not come as surprise as we, poor human beings, have tendency to make our lives easier, especially since out of 30 letters of the Latin alphabet we’re using, 25 are identical with the English alphabet and even those 5 are expressed with English letters in everyday use:

Đ đ	D d
Č č	C c
Ž ž	Z z
Š š	S s
Ć ć	C c

Local vs English alphabet letters

Therefore, in order to generate local-language word lists we will need:

Sources – These can be popular (news) web sites in local language as they contain variations of words of different length,
Crawler – Software that can efficiently fetch content from the source,
Transformation – Software that can transform (convert) letters from the UTF-8 Latin Extended-A set to ASCII characters, and
Sorting – Software that can sort and clean word lists.

Luckily, most Linux distributions and UNIX derivatives comes with the set of POSIX-mandatory utilities for transformation and sorting in the base system, such as sed, sort and uniq. That leaves us only with the crawler problem. Well, meet

CeWL

CeWL, the Custom Word List generator, is a ruby-based crawler app written by Robin Wood. Depending on the passed command-line arguments, it spiders the URL to specific link depth and returns the set of words that matches search criteria. It comes preloaded with major pen testing Linux distributions such as Kali Linux and ParrotOS, and it can also be easily installed on any other system via simple git repository cloning.

Here are the options:

./cewl.rb

CeWL 5.5.2 (Grouping) Robin Wood (robin@digi.ninja) (https://digi.ninja/)
Usage: cewl [OPTIONS] ... <url>

    OPTIONS:
	-h, --help: Show help.
	-k, --keep: Keep the downloaded file.
	-d <x>,--depth <x>: Depth to spider to, default 2.
	-m, --min_word_length: Minimum word length, default 3.
	-o, --offsite: Let the spider visit other sites.
	-w, --write: Write the output to the file.
	-u, --ua <agent>: User agent to send.
	-n, --no-words: Don't output the wordlist.
	-a, --meta: include meta data.
	--meta_file file: Output file for meta data.
	-e, --email: Include email addresses.
	--email_file <file>: Output file for email addresses.
	--meta-temp-dir <dir>: The temporary directory used by exiftool when parsing files, default /tmp.
	-c, --count: Show the count for each word found.
	-v, --verbose: Verbose.
	--debug: Extra debug information.

	Authentication
	--auth_type: Digest or basic.
	--auth_user: Authentication username.
	--auth_pass: Authentication password.

	Proxy Support
	--proxy_host: Proxy host.
	--proxy_port: Proxy port, default 8080.
	--proxy_username: Username for proxy, if required.
	--proxy_password: Password for proxy, if required.

	Headers
	--header, -H: In format name:value - can pass multiple.

    <url>: The site to spider.

As we can see, there are plenty of parameters that can help us with our search. Word of advice: when crawling sites extensively, it is always good idea to use proxy and spoof your User-Agent request header (both supported by CeWL). Let’s try to spider our company’s web site. Although it is entirely in English, for the sake of demonstration of basic CeWL capabilities, we’re interested in words that are at least 8 characters long, and we will go with the link depth of 2. Also, we would like to store the result in results.txt file:

┌──(kali㉿kali)-[~/tools/wordlist]
└─$ cewl -d 2 -m 8 -x 13 -w results.txt https://www.sectreme.com
CeWL 6.1 (Max Length) Robin Wood (robin@digi.ninja) (https://digi.ninja/)
                                                                                                                                               
┌──(kali㉿kali)-[~/tools/wordlist]
└─$

wc -l < results.txt shows us that we got 655 words. Cool! Or better, CeWL!

Now, let’s try to spider one of the popular sites in the region: Belgrade-based Politika newspaper, CMS with modest number of text and links, with

cewl --lowercase -d 2 -m 8 -x 13 -w politika.txt https://www.politika.rs/sr

and after an hour, on 5mbit connection, we got ~440kb file with 33758 words. This is just to point out how resource-intensive CeWL can be, even with the modest amount of text following links 2 levels deep.

Quick look into the result dataset and we can immediately see that we have to do second part of our job called

Transformation

We already said that we’ll be needing some sort of conversion to match ASCII style passwords based on local language, and for that we’ll be using powerful stream editor sed and it’s regex matching (btw, sed just celebrated it’s 50th birthday – that is how long good tools last!). First, we have to find out what do we search for and with what we will replace it. As previously mentioned, in order to convert those 5 unicode multibyte characters to ASCII equivalents, it would be good idea to deduce their hex representation. We can do it with

┌──(kali㉿kali)-[~/tools/wordlist]
└─$ echo "č" | hexdump -C
00000000  c4 8d 0a                                          |...|
00000003

and we have our first two-bytes-long search-and match pattern \xc4\x8d for the letter č. Doing the same for the rest, we can write following sed expressions:

`sed 's/\xc4\x91/d/g'`	đ -> d
`sed 's/\xc5\xbe/z/g'`	ž -> z
`sed 's/[\xc4\x87,\xc4\x8d]/c/g'`	č,ć -> c
`sed 's/\xc5\xa1/s/g'`	š -> s

Easily, we can pipe those and convert all non-ASCII characters into ASCII equivalents with one-liner such as

cat politika.txt| sed 's/\xc4\x91/d/g'| sed 's/\xc5\xbe/z/g'| sed 's/[\xc4\x87,\xc4\x8d]/c/g'| sed 's/\xc5\xa1/s/g'

and while we’re there, why not pipe it further so we sort it alphabetically, remove duplicates and find out how many words we’re left with after transformation? Here’s the line doing just that:

We got our first, sorted and cleaned 30+ k words local word list ready for further use on a long quest for passwords with tools such as john , hydra and hashcat.

wlgen.sh

For those of you seeking automation, I wrote a small bash script called wlgen.sh that implements mentioned steps to produce clean localized word lists. For web crawling and word fetching, wlgen.sh relies on CeWL and checks for its presence together with other prerequisites (ruby and git) – if those are not installed, wlgen will install them together with gems that CeWL requires. wlgen.sh will crawl the web site, fetch all the words from it (by given min and max length parameters), convert all UTF-8 Latin Extended-A characters to their ASCII equivalents, remove duplicates and sort the list, saving it in FQDN.wordlist file.

wlgen.sh is available as GitHub repo.

rajcevic.blog

REVERSE ENGINEERING, MALICIOUS BYTES ANALYSIS AND CYBERSECURITY STUFF

Generating localized word lists with CeWL and the usual suspects from the POSIX toolbox

Leave a comment Cancel reply

Related

Leave a comment Cancel reply