New dataset: "Erebus"
Added 2022-08-31 12:19:19 +0000 UTCI have finally done it: I am currently cleaning and compiling a new dataset called "Erebus".
Some specifications:
Size: 4Gb (200k stories) (before cleaning)
Contains:
- Dataset G (Pixiv)
- Literotica (4.5 or higher)
- Sexstories (90% or higher)
- Pike (selected on "Adult" stories)
- Doc's Lab (90% or higher)
- SoFurry (mixture of various tags)
The dataset needs to be pruned of all the short <10kb stories, will be cleaned up using the same settings I did for Nerys-v2, and most likely will have multiple variations made (2.7B, 6B, 13B, 30B). I am debating whether I should use Nerys-V2 as the base or just the base model.
The 30B version will most likely be privately hosted for those that would like to spread the word using the KoboldAI-cluster. Please note that since I am eating the cost of running the server, it will most likely be available for patrons and supporters, and I won't be running it 24/7.
On a related matter, currently having issues with Runpod being hoarded by GPU miners. Although I am partly responsible for this (sorry), I have to say that the income I get from it does pay most of the bills. I hope that shortly it will even be able to pay a KoboldAI instance fully. But to get to that point, I need to do some legwork.
Comments
Hello Mr. Seeker, I tried out the Erebus model and I think its quite good. I was not really impressed with the Shinen13B model so I was still using the Shinen 6B when I wanted to do NSFW stuff. As for the Shinen 13B model, I found it not really responsive. It would follow what I imput quite well but was rather bland if you just try to let it tell a story even at high temperature. So far though, I am very impressed with Erebus. I am finding that it starts to loop and repeat very quickly although this could be due to my own prompts and experimenting with the settings. Thank you for this hard work!
Tony V
2022-10-02 17:58:54 +0000 UTCThats a great idea, but I think I need to know what the cut-off metric should be (https://pypi.org/project/py-readability-metrics/)
Julius
2022-08-31 13:05:53 +0000 UTCWhat do you think about further limiting the dataset based on something like Flesch-Kincaid Grade Level? There are a lot of texts out there that are considered "good" rating wise but are very simplistic which could lead to simplistic output for those of us that can't write well. Something like textstat in python could give a lot of grade level stats.
ebolam
2022-08-31 12:50:37 +0000 UTC